## Abstract

Brains must represent the outside world so that animals survive and thrive. In early sensory systems, neural populations have diverse receptive fields structured to detect important features in inputs, yet significant variability has been ignored in classical models of sensory neurons. We model neuronal receptive fields as random, variable samples from parameterized distributions and demonstrate this model in two sensory modalities using data from insect mechanosensors and mammalian primary visual cortex. Our approach leads to a significant theoretical connection between the foundational concepts of receptive fields and random features, a leading theory for understanding artificial neural networks. The modeled neurons perform a randomized wavelet transform on inputs, which removes high frequency noise and boosts the signal. Further, these random feature neurons enable learning from fewer training samples and with smaller networks in artificial tasks. This structured random model of receptive fields provides a unifying, mathematically tractable framework to understand sensory encodings across both spatial and temporal domains.

## 1 Introduction

It has long been argued that the brain uses a large population of neurons to represent the world [98, 31, 81, 89]. In this view, sensory stimuli are encoded by the responses of the population, which are then used by downstream areas for diverse tasks, including learning, decision-making, and movement control. These sensory areas have different neurons responding to differing stimuli while also providing a measure of redundancy. However, we still lack a clear understanding of what response properties are well-suited for different sensory modalities.

One way to approach sensory encoding is by understanding how a neuron would respond to arbitrary stimuli. Experimentally, we typically present many stimuli to the animal, measure the responses of sensory neurons, then attempt to estimate some kind of model for how the neurons respond to an arbitrary stimulus. A common assumption is that the neuron computes a linear filter of the stimulus, which then drives spiking through a nonlinear spike-generating mechanism. Mathematically, this assumption can be summarized as the number of measured spikes for a stimulus * x* being equal to

*σ*(

*) for a weight vector*

**w**^{T}**x***and nonlinearity*

**w***σ*. Here, the weights

*define the filtering properties of the neuron, also known as its*

**w***receptive field*[84]. This model is known as a

*linear-nonlinear*(LN) model [18], and it is also the most common form of artificial neuron in artificial neural networks (ANNs). LN models have been used extensively to describe the firing of diverse neurons in various sensory modalities of vertebrates and invertebrates. In the mammalian visual system, LN models have been used to characterize retinal ganglion cells [80], lateral geniculate neurons [20], and simple cells in primary visual cortex (V1) [45]. They have also been used to characterize auditory sensory neurons in the avian midbrain [47] and somatosensory neurons in the cortex [79]. In insects, they have been used to understand the response properties of visual interneurons [77], mechanosensory neurons involved in proprioception [29, 70], and auditory neurons during courtship behavior [21].

Given the stimulus presented and neural response data, one can then estimate the receptive fields of a population of neurons. Simple visual receptive fields have classically been understood as similar to wavelets with particular spatial frequency and angular selectivity [45]. In mechanosensory areas, receptive fields are selective to temporal frequency over a short time window [29]. Commonly, parametric modeling (Gabor wavelets [89]) or smoothing (regularization, etc. [67]) are used to produce “clean” receptive fields. Yet, the data alone show noisy receptive fields that are perhaps best modeled using a random distribution [9]. As we will show, modeling receptive fields as random samples produces realistic receptive fields that reflect both the structure and noisiness seen in experimental data. More importantly, this perspective creates significant theoretical connections between foundational ideas from neuroscience and artificial intelligence. This connection helps us understand why receptive fields have the structures that they do and how this structure relates to the kinds of stimuli that are relevant to the animal.

Modeling the filtering properties of a population of LN neurons as samples from a random distribution leads to the study of networks with random weights [76, 15, 53]. In machine learning (ML), such networks are known as *random feature networks* (RFNs) [12, 41, 72, 54]. The study of RFNs has rapidly gained popularity in recent years, in large part because it offers a theoretically tractable way to study the learning properties of ANNs where the weights are tuned using data [3, 2, 17]. When the RFN contains many neurons, it can approximate functions that live in a well-understood function space. This function space is called a *reproducing kernel Hilbert space* (RKHS), and it depends on the network details, in particular the weight i.e. receptive field distribution [60, 93, 73]. Learning can then be framed as approximating functions in this space from limited data.

Several recent works highlight the RFN theory’s usefulness for understanding learning in neural systems. Bordelon, Canatar, and Pehlevan, in a series of papers, have shown that neural codes allow learning from few examples when spectral properties of their second-order statistics aligns with the spectral properties of the task [10, 11, 14]. When applied to V1, they found that the neural code is aligned with tasks that depend on low spatial frequency components. Harris constructed an RFN model of sparse networks found in associative centers like the cerebellum and insect mushroom body and showed that these areas may behave like additive kernels [36], an architecture also considered by Hashemi et al. [37]. These classes of kernels are beneficial for learning in high dimensions because they can learn from fewer examples and remain resilient to input noise or adversarial perturbation. Xie et al. investigated the relationship between the fraction of active neurons in a model of the cerebellum—controlled by neuron thresholds—and generalization performance for learning movement trajectories [94]. In the vast majority of network studies with random weights, these weights * w* are drawn from a Gaussian distribution with independent entries. This sampling is equivalent to a fully

*unstructured*receptive field, which looks like white noise.

Closely related to our work, a previous study of ANNs showed that directly learning structured receptive fields could improve image classification in deep networks [43]. Their receptive fields were parametrized as a sum of Gaussian derivatives up to fourth order. This led to better performance against rival architectures in low data regimes.

In this paper, we study the effect of having *structured yet random* receptive fields and how they lead to informative sensory encodings. Specifically, we consider receptive fields generated by a Gaussian process (GP), which can be thought of as drawing the weights * w* from a Gaussian distribution with a particular covariance matrix. We show that networks with such random weights project the input to a new basis and filter out particular components. This theory introduces realistic structure of receptive fields into random feature models which are crucial to our current understanding of artificial networks. Next, we show that receptive field datasets from two disparate sensory systems, mechanosensory neurons on insect wings and V1 cortical neurons from mice and monkeys, are well-modeled by GPs with covariance functions that have wavelet eigenbases. Given the success of modeling these data with the GP, we apply these weight distributions in RFNs that are used in synthetic learning tasks. We find that these structured weights improve learning by reducing the number of training examples and the size of the network needed to learn the task. Thus, structured random weights offer a realistic generative model of the receptive fields in multiple sensory areas, which we understand as performing a random change of basis. This change of basis enables the network to represent the most important properties of the stimulus, which we demonstrate to be useful for learning.

## 2 Results

We construct a generative model for the receptive fields of sensory neurons and use it for the weights of an ANN. We refer to such a network as a *structured* random feature network. In Section 2.1, we review the basics of random feature networks, the details and rationale behind our generative model, and the process by which we generate hidden weights. Our main theory result is that networks with such weights transform the inputs into a new basis and filter out particular components, thus bridging sensory neuroscience and the theory of neural networks. In Section 2.2, we show that neurons in two receptive field datasets—insect mechanosensory neurons and mammalian V1 cortical neurons—are well-described by our generative model. There is a close resemblance between the the second-order statistics, sampled receptive fields, and their principal components for both data and model. Finally, in Section 2.3 we show the performance of structured random feature networks on several synthetic learning tasks. The hidden weights from our generative model allows the network to learn from fewer training examples and smaller network sizes.

### 2.1 Theoretical analysis

We consider receptive fields generated by GPs in order to connect this foundational concept from sensory neuroscience to the theory of random features in artificial neural networks. GPs can be thought of as samples from a Gaussian distribution with a particular covariance matrix, and we initialize the hidden weights of RFNs using these GPs. We show that using a GP causes the network to project the input into a new basis and filter out particular components. The basis itself is determined by the covariance matrix of the Gaussian, and is useful for removing irrelevant and noisy components from the input. We use these results to study the space of functions that RFNs containing many neurons can learn by connecting our construction to the theory of kernel methods.

#### 2.1.1 Random feature networks

We start by introducing the main learning algorithm and the neuronal model of our work, the RFN. Consider a two-layer, feedforward ANN. Traditionally, all the weights are initialized randomly and learned through backpropagation by minimizing some loss objective. In sharp contrast, RFNs have their hidden layer weights sampled randomly from some distribution and fixed. Each hidden unit computes a random feature of the input, and only the output layer weights are trained (Fig. 1). Note that the weights are randomly drawn but the neuron’s response is a deterministic function of the input given the weights.

Mathematically, we have the hidden layer activations and output given by
where is the stimulus, are the hidden neuron responses, and is the predicted output. We use a rectified linear (ReLU) nonlinearity, *σ*(*x*) = *max*(0, *x*) applied entrywise in (1). The hidden layer weights are drawn randomly and fixed. Only the readout weights *β*_{0} and * β* are trained in RFNs.

In our RFN experiments, we train the readout weights and offset using a support vector machine (SVM) classifier with squared hinge loss and *ℓ*^{2} penalty with regularization strength of tuned in the range [10^{−3}, 10^{3}] by 5-fold cross-validation. Our RFNs do not include a threshold for the hidden neurons, although this could help in certain contexts [94].

In the vast majority of studies with RFNs, each neuron’s weights are initialized i.i.d. from a spherical Gaussian distribution . We will call networks built this way *classical unstructured* RFNs (Fig. 1). We propose a variation where hidden weights are initialized , where is a positive semidefinite covariance matrix. We call such networks *structured* RFNs (Fig. 1), to mean that the weights are random with a specified covariance. To compare unstructured and structured weights on equal footing, we normalize the covariance matrices so that Tr(* C*) = Tr(

*) =*

**I**_{d}*d*, which ensures that the mean square amplitude of the weights .

#### 2.1.2 Receptive fields modeled by linear weights

Sensory neurons respond preferentially to specific features of their inputs. This stimulus selectivity is often summarized as a neuron’s receptive field, which describes how features of how the sensory space elicits responses when stimulated [84]. Mathematically, receptive fields are modeled as a linear filter in the stimulus space. Linear filters are also an integral component of the widely used LN model of sensory processing [18]. According to this model, the firing rate of a neuron is a nonlinear function applied to the projection of the stimulus onto the low-dimensional subspace of the linear filter.

A linear filter model of receptive fields can explain responses of individual neurons to diverse stimuli. It has been used to describe disparate sensory systems like visual, auditory, and somatosensory systems of diverse species including birds, mammals, and insects [80, 77, 47, 21, 79]. If the stimuli are uncorrelated, the filters can be estimated by computing the spike triggered average (STA), the average stimulus that elicited a spike for the neuron. When the stimuli are correlated, the STA filter is whitened by the inverse of the stimulus covariance matrix [66]. Often these STAs are denoised by fitting a parametric function to the STA [18], such as Gabor wavelets for simple cells in V1 [45].

We model the receptive field of a neuron *i* as its weight vector * w_{i}* and its nonlinear function as

*σ*. Instead of fitting a parametric function, we construct covariance functions so that each realization of the resulting Gaussian process resembles a biological receptive field (Fig. 1).

#### 2.1.3 Structured weights project and filter input into the covariance eigenbasis

We generate network weights from Gaussian processes (GP) whose covariance functions are inspired by the receptive fields of sensory neurons in the brain. By definition, a GP is a stochastic process where finite observations follow a Gaussian distribution [74]. We find that networks with such weights project inputs into a new basis and filter out irrelevant components. In Section 2.3, we will see that this adds an inductive bias to classical RFNs for tasks with naturalistic inputs and improves learning.

We view our weight vector * w* as the finite-dimensional discretization of a continuous function

*w*(

*t*) which is a sample from a GP. The continuous function has domain

*T*, a compact subset of , and we assume that

*T*is discretized using a grid of

*d*equally spaced points {

*t*

_{1},…,

*t*} ⊂

_{d}*T*, so that

*w*=

_{i}*w*(

*t*). Let the input be a real-valued function

_{i}*x*(

*t*) over the same domain

*T*, which could represent a finite timeseries (

*D*= 1), an image of luminance on the retina (

*D*= 2), or more complicated spatiotemporal sets like a movie (

*D*= 3). In the continuous setting, the

*d*-dimensional

*ℓ*

^{2}inner product gets replaced by the

*L*

^{2}(

*T*) inner product 〈

*w, x*〉 =

*∫*

_{t∈T}

*w*(

*t*)

*x*(

*t*)

*dt*.

Every GP is fully specified by its mean and covariance function *C*(*t, t′*). We will always assume that the mean is zero and study different covariance functions. By the Kosambi-Karhunen-Loève theorem [49], each realization of a zero-mean GP has a random series representation
in terms of standard Gaussian random variables , functions *ϕ _{i}*(

*t*), and weights

*λ*≥ 0. The pairs are eigenvalue, eigenfunction pairs of the covariance operator , which is the continuous analog of the covariance matrix

_{i}*. If*

**C***C*(

*t, t′*) is positive definite, as opposed to just semidefinite, all and these eigenfunctions

*ϕ*form a complete basis for

_{i}*L*

^{2}(

*T*). Using (2), the inner product between a stimulus and a neuron’s weights is

Equation (3) shows that the structured weights compute a *projection* of the input x onto each eigenfunction 〈*ϕ _{i}, x*〉 and reweight or

*filter*by the eigenvalue

*λ*before taking the

_{i}*ℓ*

^{2}inner product with the random Gaussian weights

*z*.

_{i}It is illuminating to see what these continuous equations look like in the *d*-dimensional discrete setting. Samples from the finite-dimensional GP are used as the hidden weights in RFNs, . First, the GP series representation (2) becomes * w* =

**ΦΛ**, where

*z***Λ**and

**Φ**are matrices of eigenvalues and eigenvectors, and is a Gaussian random vector. By the definition of the covariance matrix, , which is equal to

**ΦΛ**

^{2}

**Φ**

^{T}after a few steps of linear algebra. Finally, (3) is analogous to

*=*

**w**^{T}**x**

**z**^{T}**ΛΦ**

^{T}

*. Since*

**x****Φ**is an orthogonal matrix,

**Φ**

^{T}

*is equivalent to a change of basis, and the diagonal matrix*

**x****Λ**shrinks or expands certain directions to perform filtering. This can be summarized in the following theorem:

**(Basis change formula)** *Assume* *with C* =

**ΦΛ**

^{2}

**Φ**

^{T}

*its eigenvalue decomposition. For*,

*define*

*Then* *for* .

Theorem 1 says that projecting an input onto a structured weight vector is the same as first filtering that input in the GP eigenbasis and doing a random projection onto a spherical random Gaussian. The form of the GP eigenbasis is determined by the choice of the covariance function. If the covariance function is compatible with the input structure, the hidden weights filter out any irrelevant features or noise in the stimuli while amplifying the descriptive features. This inductive bias facilitates inference on the stimuli by any downstream predictor. Because the spherical Gaussian distribution is the canonical choice for unstructured RFNs, there is a simple way to evaluate the effective kernel of structured RFNs as (see Appendix A.1).

Our expression for the structured kernel provides a concrete connection to the kernel theory of learning using nonlinear neural networks. For readers interested in such kernel theories, a full example of how these work is given in Appendix A.2. There we show that there can be an exponential reduction in the number of samples needed to learn frequency detection using a structured versus unstructured basis.

### 2.2 Examples of random yet structured receptive fields

Our goal is to model the weights of artificial neurons in a way that is inspired by biological neurons’ receptive fields. Structured RFNs sample hidden weights from GPs with structured covariance, so we construct covariance functions that make the generated weights resemble neuronal receptive fields. We start with a toy example of a stationary GP with well-understood Fourier eigenbasis and show how the receptive fields generated from this GP are selective to frequencies in timeseries signals. Then, we construct locally stationary covariance models of the of insect mechanosensory and V1 neuron receptive fields. These models are shown to be a good match for experimental data.

#### 2.2.1 Warm-up: frequency selectivity from stationary covariance

To illustrate some results from our theoretical analysis, we start with a toy example of temporal receptive fields that are selective to particular frequencies. This example may be familiar to readers comfortable with Fourier series and basic signal processing. Let the input be a finite continuous timeseries *x*(*t*) over the interval *T*= [0, *L*]. We use the covariance function
where *ω _{k}* = 2

*πk/L*is the

*k*th natural frequency and are the weight coefficients. The covariance function (5) is

*stationary*, which means that it only depends on the difference between the timepoints

*t*−

*t*. Applying the compound angle formula, we get

Since the sinusoidal functions cos(*ω _{k}t*) and sin(

*ω*) form an orthonormal basis for

_{k}t*L*

^{2}(

*T*), (6) is the eigendecomposition of the covariance, where the eigenfunctions are sines and cosines with eigenvalues . From (2), we know that structured weights with this covariance form a random series: where each . Thus, the receptive fields are made up of sinusoids weighted by

*λ*and the Gaussian variables

_{k}*z*, .

_{k}Suppose we want receptive fields that only retain specific frequency information of the signal and filter out the rest. Take *λ _{k}* = 0 for any

*k*where

*ω*<

_{k}*f*

_{lo}or

*ω*>

_{k}*f*

_{hi}. We call this a

*bandlimited*spectrum with passband [

*f*

_{lo},

*f*

_{hi}] and bandwidth

*f*

_{hi}−

*f*

_{lo}. As the bandwidth increases, the receptive fields become less smooth since they are made up of a wider range of frequencies. If the

*λ*are all nonzero but decay at a certain rate, this rate controls the smoothness of the resulting GP [91].

_{k}When these receptive fields act on input signals *x*(*t*), they implicitly transform the inputs into the Fourier basis and filter frequencies based on the magnitude of *λ _{k}*. In a bandlimited setting, any frequencies outside the passband are filtered out, which makes the receptive fields selective to a particular range of frequencies and ignore others. On the other hand, classical random features weight all frequencies equally, even though in natural settings high frequency signals are the most corrupted by noise.

#### 2.2.2 Insect mechanosensors

We next consider a particular biological sensor that is sensitive to the time-history of forces. Campaniform sensilla (CS) are dome-shaped mechanoreceptors that detect local stress and strain on the insect exoskeleton [23]. They are embedded in the cuticle and deformation of the cuticle through bending or torsion induces depolarizing currents in the CS by opening mechanosensitive ion channels. The CS encode proprioceptive information useful for body state estimation and movement control during diverse tasks like walking, kicking, jumping, and flying [23].

We will model the receptive fields of CS that are believed to be critical for flight control, namely the ones found at the base of the halteres [97] and on the wings [70] (Fig. 2A). Halteres and wings flap rhythmically during flight, and rotations of the insect’s body induce torsional forces that can be felt on these active sensory structures. The CS detect these small strain forces, thereby encoding angular velocity of the insect body [97]. Experimental results show haltere and wing CS are selective to a broad range of oscillatory frequencies [28, 70], with STAs that are smooth, oscillatory, selective to frequency, and decay over time [29] (Fig. 2B).

We model these temporal receptive fields with a locally stationary GP [33] with bandlimited spectrum. Examples of receptive fields generated from this GP are shown in Fig. 2C. The inputs to the CS are modeled as a finite continuous timeseries x(t) over the finite interval *T* = [0, *L*]. The neuron weights are generated from a covariance function
where *ω _{k}* = 2

*πk/L*is the

*k*th natural frequency. As in Section 2.2.1, the frequency selectivity of their weights is accounted for by the parameters

*f*

_{lo}and

*f*

_{hi}. As the bandwidth

*f*

_{lo}−

*f*

_{hi}increases, the receptive fields are built out of a wider selection of frequencies. This makes the receptive fields less smooth (Fig. 2D). Each field is localized to near

*t*= 0, and its decay with

*t*is determined by the parameter

*γ*. As

*γ*increases, the receptive field is selective to larger time windows.

The eigenbasis of the covariance function (8) is similar to a Fourier eigenbasis modulated by a decaying exponential. The eigenbasis is an orthonormal basis for the span of *λ _{k}e*

^{−t/γ}cos(

*ω*) and

_{k}t*λ*

_{k}e^{−t/γ}sin(

*ω*), which are a non-orthogonal set of functions in

_{k}t*L*

^{2}(

*T*). The hidden weights transform timeseries inputs into this eigenbasis and discard components outside the passband frequencies [

*f*

_{lo},

*f*

_{hi}].

We fit the covariance model to receptive field data from 95 CS neurons from wings of the hawkmoth *Manduca sexta* (data from [70]). Briefly, CS receptive fields were estimated as the spike-triggered average (STA) of experimental mechanical stimuli of the wings, where the stimuli were generated as bandpassed white noise (2–300 Hz).

To characterize the receptive fields of this population of CS neurons, we compute the data covariance matrix **C**_{data} by taking the inner product between the receptive fields. We normalize the trace to be the dimension of each receptive field (number of samples), which in this case is 40 kHz × 40 ms = 1600 samples. This normalization sets the overall scale of the covariance matrix. The data covariance matrix shows a tridiagonal structure (Fig. 3A). The main diagonal is positive while the off diagonals are negative. All diagonals decay away from the top left of the matrix.

To fit the covariance model to the data, we optimize the parameters (see Appendix A.3) *f*_{lo}, *f*_{hi}, and *γ*, finding *f*_{lo} = 75 Hz, *f*_{hi} = 200 Hz, and *γ* = 12.17 ms best fit the sensilla data. We do so by minimizing the Frobenius norm of the difference between **C**_{data} and the model (26). The resulting model covariance matrix (Fig. 3B) matches the data covariance matrix (Fig. 3A) remarkably well qualitatively. The normalized Frobenius norm of the difference between **C**_{data} and the model is 0.4386. Examples of biological receptive fields and random samples from this fitted covariance model are shown in the Appendix (Fig. 24). To simulate the effect of a finite number of neurons, we generate 95 weight vectors (equal to the number of neurons recorded) and recompute the model covariance matrix (Fig. 3C). We call this the finite neuron model covariance matrix **C**_{finite}, and it shows the bump and blob-like structures evident in **C**_{data} but not in **C**_{model}. This result suggests that these bumpy structures can be attributed to having a small number of recorded neurons. We hypothesize that these effects would disappear with a larger dataset and **C**_{data} would more closely resemble **C**_{model}.

For comparison, we also calculate the Frobenius difference for null models, the unstructured covariance model and the Fourier model (5). For the unstructured model, the Frobenius norm difference is 0.9986 while that of the Fourier model is 0.9123. The sensilla covariance model has a much lower difference (0.4386) compared to the null models, fitting the data more accurately. We show the covariance matrices and sampled receptive fields from these null models in the Appendix A.4.1.

Comparing the eigenvectors and eigenvalues of the data and model covariance matrices, we find that the spectral properties of both **C**_{model} and **C**_{finite} are similar to that of **C**_{data}. The eigenvalue curves of the models match that of the data quite well (Fig. 3E); these curves are directly comparable because each covariance is normalized by its trace, which makes the sum of the eigenvalues unity. Further, all of the data and the model covariance matrices are low-dimensional. The first 10 data eigenvectors explain 97% of the variance, and the top 5 explain 90%. The top 5 eigenvectors of the model and its finite sample match that of the data quite well (Fig. 3D).

#### 2.2.3 Primary visual cortex

We now turn to visually driven neurons from the mammalian primary cortex. Primary visual cortex (V1) is the earliest cortical area for processing visual information (Fig. 4A). The neurons in V1 can detect small changes in visual features like orientations, spatial frequencies, contrast, and size.

Here, we model the receptive fields of simple cells in V1, which have clear excitatory and inhibitory regions such that light shone on the excitatory regions increase the cell’s response and vice-versa (Fig. 4B). The shape of the regions determines the orientation selectivity, while their widths determine the frequency selectivity. The receptive fields are centered to a location in the visual field and decay away from it. They integrate visual stimuli within a small region of this center [40]. Gabor functions are widely used as a mathematical model of the receptive fields of simple cells [45].

We model these receptive fields using another locally stationary GP [33] and show examples of generated receptive fields in Fig. 4C. Consider the inputs to the cortical cells to be a continuous two-dimensional image *x*(* t*), where the domain

*T*= [0,

*L*] × [0,

*L′*] and . Since the image is real-valued,

*x*(

*) is the grayscale contrast or single color channel pixel values. The neuron weights are then generated from a covariance function of the following form:*

**t**The receptive field center is defined by * c*, and the size of the receptive field is determined by the parameter

*s*. As

*s*increases, the receptive field extends farther from the center

*(Fig. 4D). Spatial frequency selectivity is accounted for by the bandwidth parameter*

**c***f*. As

*f*decreases, the spatial frequency of the receptive field goes up, making the weights less smooth (Fig. 4E).

The eigendecomposition of the covariance function (9) leads to an orthonormal basis of single scale *Hermite wavelets* [56, 57]. When * c* = 0, the wavelet eigenfunctions are Hermite polynomials modulated by a decaying Gaussian:
where

*H*is the

_{k}*k*th (physicist’s) Hermite polynomial; eigenfunctions for nonzero centers

*are just shifted versions of (10). The detailed derivation and values of the constants*

**c***c*

_{1},

*c*

_{2},

*c*

_{3}and normalization are in Appendix A.5.

We use (9) to model receptive field data from 8,358 V1 neurons recorded with calcium imaging from transgenic mice expressing GCaMP6s; the mice were headfixed and running on an air-floating ball. We presented 24,357 unique white noise images of 14×36 pixels using the Psychtoolbox [46], where the pixels were white or black with equal probability. Images were upsampled to the resolution of the screens via bilinear interpolation. The stimulus was corrected for eye-movements online using custom code. The responses of 45,026 cells were collected using a two-photon mesoscope [86] and preprocessed using Suite2p [65]. Receptive fields were calculated from the white noise images and the deconvolved calcium responses of the cells using the STA. For the covariance analysis, we picked cells above the signal-to-noise (SNR) threshold of 0.4; this gave us 8,358 cells. The SNR was calculated from a smaller set of 2,435 images that were presented twice using the method from [89]. As a preprocessing step, we moved the center of mass of every receptive field to the center of the visual field.

We compute the data covariance matrix **C**_{data} by taking the inner product between the receptive fields. We normalize the trace to be the dimension of each receptive field, which in this case is (14 × 36) pixels = 504 pixels. The data covariance matrix resembles a tridiagonal matrix. However, the diagonals are non-zero only at equally spaced segments. Additionally, their values decay away from the center of the matrix. We show **C**_{data} zoomed in at the non-zero region around the center of the matrix (Fig. 5A); this corresponds to the 180 × 180 pixel region around the center of the full 504 × 504 pixel matrix. The full covariance matrix is shown in the Appendix A.10 (Fig. 22).

In the covariance model, the number of off-diagonals, the center, the rate of their decay away from the center are determined by the parameters *f, s* and * c* respectively. The covariance between pixels decays as a function of their distance from

*. This leads to the the equally-spaced non-zero segments. On the other hand, the covariance also decays as a function of the distance between pixels. This brings the the diagonal structure to the model. When the frequency parameter*

**c***f*increases, the number of off-diagonals increases. Pixels in the generated weights become more correlated and the weights become spatially smoother. When the size parameter

*s*increases, the diagonals decay slower from the center

*, increasing correlations with the center pixel and leading the significant weights to occupy more of the visual field.*

**c**We again optimize the parameters to fit the data (Appendix A.3.2), finding *s* = 1.87 and *f* = 0.70 pixels. We do so by minimizing the Frobenius norm of the difference between **C**_{data} and the model. We do not need to optimize over the center parameter * c*, since we preprocess the data so that all receptive fields are centered at

*= (7, 18), the center of the 14×36 grid. The resulting model covariance matrix (Fig. 5B) and the data covariance matrix (Fig. 5A) match remarkably well qualitatively. The normalized Frobenius norm of the difference between*

**c**

**C**_{data}and the model is 0.2993. Examples of biological receptive fields and random samples from the fitted covariance model are shown in Fig. 23 in the Appendix. To simulate the effect of a finite number of neurons, we generate 8,358 weights, equal to the number of neurons in our data, to compute

**C**_{finite}shown in Fig. 5C. This finite matrix

**C**_{finite}looks even more like

**C**_{data}, and it shows that some of the negative covariances far from center result from finite sample size but not all.

For comparison, we also calculate the normalized Frobenius difference for null models, the unstructured covariance model and the translation invariant V1 model (28). In the translation invariant model, we remove the spatially localizing exponential in Eq. 9 and only fit the spatial frequency parameter, *f*. For the unstructured model, the Frobenius norm difference is 0.9835 while that of the translation invariant model is 0.9727. The V1 covariance model has a much lower difference (0.2993) and is a better fit to the data. We show the covariance matrices and sampled receptive fields from these null models in the Appendix A.4.2.

Similar spectral properties are evident in the eigenvectors and eigenvalues of **C**_{model}, **C**_{finite}, **C**_{data}, and the analytical forms derived in (10) (Fig. 5D,E). As in Section 2.2.2, the covariances are normalized to have unit trace. Note that the analytical eigenfunctions are shown on a finer grid than the model and data because the analysis was performed in continuous space. The differences between the eigenfunctions and eigenvalues of the analytical and model results are due to discretization. Examining the eigenvectors (Fig. 5D), we also see a good match, although there are some rotations and differences in ordering. These 10 eigenvectors explain 68% of the variance in the receptive field data. For reference, the top 80 eigenvectors explain 86% of the variance in the data and all of the variance in the model. The eigenvalue curves of both the models and the analytical forms match that of the data (Fig. 5E) reasonably well, although not as well as for the mechanosensors. In Appendix A.10, we repeat this analysis for receptive fields measured with different stimulus sets in the mouse and different experimental dataset from non-human primate V1. Our findings are consistent with the results shown above.

### 2.3 Advantages of structured random weights for artificial learning tasks

Our hypothesis is that neuronal inductive bias from structured receptive fields allows networks to learn with fewer neurons, training examples, and steps of gradient descent for classification tasks with naturalistic inputs. To examine this hypothesis, we compare the performance of structured receptive fields against classical ones on several classification tasks. We find that, for most artificial learning tasks, structured random networks learn more accurately from smaller network sizes, fewer training examples, and gradient steps.

#### 2.3.1 Frequency detection

CS naturally encode the time-history of strain forces acting on the insect body and sensors inspired by their temporal filtering properties have been shown to accurately classify spatiotemporal data [59]. Inspired by this result, we test sensilla-inspired mechanosensory receptive fields from Section 2.2.2 on a timeseries classification task (Fig. 6A, top). Each example presented to the network is a 100 ms timeseries sampled at 2 kHz so that d = 200, and the goal is to detect whether or not each example contains a sinusoidal signal. The positive examples are sinusoidal signals with *f*_{1} = 50 Hz and corrupted by noise so that their SNR = 1.76 (2.46 dB). The negative examples are Gaussian white noise with matched amplitude to the positive examples. Note that this frequency detection task is not linearly separable because of the random phases in positive and negative examples. See Section A.7 for additional details including the definition of SNR and how cross-validation was used to find the optimal parameters *f*_{lo} = 10 Hz, *f*_{hi} = 60 Hz, and *γ* = 50 ms.

For the same number of hidden neurons, the structured RFN significantly outperforms a classical RFN. We show test performance using these tuned parameters in Fig. 6A. Even in this noisy task, it achieves 0.5% test error using only 25 hidden neurons. Meanwhile, the classical network takes 300 neurons to achieve similar error.

Predictably, the performance suffers when the weights are *incompatible* with the task. We show results when *f*_{lo} = 10 Hz and *f*_{hi} = 40 Hz and the same γ (Fig. 6A). The incompatible RFN performs better than chance (50% error) but much worse than the classical RFN. It takes 300 neurons just to achieve 16.3% test error. The test error does not decrease below this level even with additional hidden neurons.

#### 2.3.2 Frequency XOR task

To challenge the mechanosensor-inspired networks on a more difficult task, we build a frequency Exclusive-OR (XOR) problem (Fig. 6B, top). XOR is a binary function which returns true if and only if the both inputs are different, otherwise it returns false. XOR is a classical example of a function that is not linearly separable and thus harder to learn. Our inputs are again 100 ms timeseries sampled at 2 kHz. The inputs either contain a pure frequency of *f*_{1} = 50 Hz or *f*_{2} = 80 Hz, mixed frequency signals with both *f*_{1} and *f*_{2}, or white noise. In both the pure and mixed frequency cases, we add noise so that the SNR = 1.76. See A.7 for details. The goal of the task is to output true if the input contains either pure tone and false if the input contains mixed frequencies or is white noise.

We tune the GP covariance parameters *f*_{lo}, *f*_{hi}, and *γ* from (8) using cross-validation. The cross validation procedure and algorithmic details are identical to that of the frequency detection task in Section 2.3.1. Using cross validation, we find the optimal parameters to be *f*_{lo} = 50 Hz, *f*_{hi} = 90 Hz, and *γ* = 40 ms. For incompatible weights, we take *f*_{lo} = 10 Hz, *f*_{hi} = 60 Hz, and the same *γ*.

The structured RFN significantly outperform classical RFN for the same number of hidden neurons. We show network performance using these parameters in Fig. 6B. Classification error of 1% can be achieved with 25 hidden neurons. In sharp contrast, the classical RFN requires 300 hidden neurons just to achieve 6%error. With incompatible weights, the network needs 300 neurons to achieve just 15.1% test error and does not improve with larger network sizes. Out of the four input subclasses, it consistently fails to classify pure 80 Hz sinusoidal signals which are outside its passband.

#### 2.3.3 Image classification

We next test the V1-inspired receptive fields from Section 2.2.3 on two standard digit classification tasks, MNIST [51] and KMNIST [19]. The MNIST and KMNIST datasets each contain 70,000 images of handwritten digits. In MNIST, these are the Arabic numerals 0–9, whereas KMNIST has 10 Japanese *hiragana* phonetic characters. Both datasets come split into 60,000 training and 10,000 test examples. With 10 classes, there are 6,000 training examples per class. Every example is a 28× 28 grayscale image with centered characters.

Each hidden weight has its center * c* chosen uniformly at random from all pixels. This ensures that the network’s weights uniformly cover the image space and in fact means that the network can represent any sum of locally-smooth functions (see Section A.6). We use a network with 1,000 hidden neurons and tune the GP covariance parameters

*s*and

*f*from (9) using 3-fold cross validation on the MNIST training set. Each parameter ranges from 1 to 20 pixels, and the optimal parameters are found with a grid search. We find the optimal parameters to be

*s*= 5 pixels and

*f*= 2 pixels. We then refit the optimal model using the entire training set. The parameters from MNIST were used on the KMNIST task without additional tuning.

The V1-inspired achieves much lower average classification error as compared to the classical RFN for the same number of hidden neurons. We show learning performance using these parameters on the MNIST task in Fig. 7A. To achieve 6% error on the MNIST task requires 100 neurons versus 1,000 neurons for the classical RFN, and the structured RFN achieves 2.5% error with 1,000 neurons. Qualitatively similar results hold for the KMNIST task (Fig. 7B), although the overall errors are larger, reflecting the harder task. To achieve 28% error on KMNIST requires 100 neurons versus 1,000 neurons for the classical RFN, and the structured RFN achieves 13% error with 1,000 neurons.

Again, network performance suffers when GP covariance parameters do not match the task. This happens if the size parameter *s* is smaller than the stroke width or spatial scale *f* doesn’t match the stroke variations in the character. Taking the incompatible parameters *s* = 0.5 and *f* = 0.5 (Fig. 7A, B), the structured weights performs worse than the classical RFN in both tasks. With 1,000 hidden neurons, it achieves the relatively poor test errors of 8% on MNIST (Fig. 7A) and 33% on KMNIST (Fig. 7B).

#### 2.3.4 Structured weights improve generalization with limited data

Alongside learning with fewer hidden neurons, V1 structured RFNs also learn more accurately from fewer examples. We test few-shot learning using the image classification datasets from Section 2.3.3. The training examples are reduced from 60,000 to 50, or only 5 training examples per class. The test set and GP parameters remain the same.

Structured encodings allow learning with fewer samples than unstructured encodings. We show these few-shot learning results in Fig. 7C and D. The networks’ performance saturate past a few hundred hidden neurons. For MNIST, the lowest error achieved by V1 structured RFN is 27% versus 33% for the classical RFN and 37% using incompatible weights (Fig. 7C). The structured network acheives 61% error using structured features on the KMNIST task, as opposed to 66% for the classical RFN and 67% using incompatible weights (Fig. 7D).

#### 2.3.5 Networks train faster when initialized with structured weights

Now we study the effect of structured weights as an initialization strategy for fully-trained neural networks where all weights in the network vary. We hypothesized that structured initialization allows networks to learn faster, i.e. that the training loss and test error would decrease faster than with unstructured weights. We have shown that the performance of RFNs improves with biologically inspired weight sampling. However, in RFNs (1) only the readout weights * β* are modified with training, and the hidden weights

*are frozen at their initial value.*

**W**We compare the biologically-motivated initialization with a classical initialization where the variance is inversely proportional to the number of hidden neurons, . This initialization is widely known as the “Kaiming He normal” scheme and is thought to stabilize training dynamics by controlling the magnitude of the gradients [39]. The classical approach ensures that , so for fair comparison we scale our structured weight covariance matrix to have Tr(* C*) = 2. In our studies with RFNs the trace is equal to

*d*, but this weight scale can be absorbed into the readout weights

*due to the homogeneity of the ReLU.*

**β**We again compare structured and unstructured weights on MNIST and KMNIST tasks, common benchmarks for fully-trained networks. The architecture is a single hidden layer feedforward neural network (Fig. 1) with 1,000 hidden neurons. The cross-entropy loss over the training sets are minimized using simple gradient descent (GD) for 3,000 epochs. For a fair comparison, the learning rate is optimized for each network separately. We define the area under the training loss curve as a metric for the speed of learning. Then, we perform a grid search in the range of (1e^{−4}, 1e^{0}) for the learning rate that minimizes this metric, resulting in the parameters 0.23, 0.14, 0.14 for structured, unstructured and incompatible networks respectively. All other parameters are the same as in Section 2.3.3.

In both the MNIST and KMNIST tasks, the V1-initialized network minimizes the loss function faster than the classically initialized network. For the MNIST task, the V1 network achieves a loss value of 0.05 after 3,000 epochs compared to 0.09 for the other network (Fig. 8A). We see qualitatively similar results for the KMNIST task. At the end of training, the V1-inspired network’s loss is 0.08, while the classically initialized network only reaches 0.12 (Fig. 8B). We find that the the V1-initialized network performs no better than classical initialization when the covariance parameters do not match the task. With incompatible parameters, the V1-initialized network achieves a loss value of 0.11 on MNIST and 0.15 on KMNIST.

Not only does it minimize the training loss faster, the V1-initialized network also generalizes well and achieves a lower test error at the end of training. For MNIST, it achieves 1.7% test error compared to 3.3% error for the classically initialized network, and 3.6% using incompatible weights (Fig. 8C). For KMNIST, we see 9% error compared to 13% error with classical initalization and 15% using incompatible weights (Fig. 8D).

We see similar results across diverse hidden layer widths and learning rates (Fig. 25–28), with the benefits most evident for wider networks and smaller learning rates. Furthermore, the structured weights show similar results when trained for 10,000 epochs (rate 0.1; 1,000 neurons; not shown) and with other optimizers like minibatch Stochastic Gradient Descent (SGD) and ADAM (batch size 256, rate 0.1; 1,000 neurons; not shown). Structured initialization facilitates learning across a wide range of networks.

However, the improvement is not universal: no significant benefit was found by initializing the early convolutional layers of the deep network AlexNet [50] and applying it to the ImageNet dataset [78], as shown in Appendix A.12. The large amounts of training data and the fact that only a small fraction of the network was initialized with structured weights could explain this null result. Also, in many of these scenarios the incompatible structured weights get to performance on par with the compatible ones by the end of training, when the poor inductive bias is overcome.

#### 2.3.6 Improving representation with structured random weights

We have shown how structured receptive field weights can improve the performance of RFNs and fully-trained networks on a number of supervised learning tasks. As long as the receptive fields are compatible with the task itself, then performance gains over unstructured features are possible. If they are incompatible, then the networks performs no better or even worse than using classical unstructured weights.

These results can be understood with the theoretical framework of Section 2.1. Structured weights effectively cause the input * x* to undergo a linear transformation into a new representation following Theorem 1. In all of our examples, this new representation is bandlimited due to how we design the covariance function.

^{1}By moving to a bandlimited representation, we both filter out noise—high-frequency components—and reduce dimensionality—coordinates in outside the passband are zero. In general, noise and dimensionality both make learning harder.

It is easiest to understand these effects in the frequency detection task. For simplicity, assume we are using the stationary features of our warm-up Section 2.2.1 to do frequency detection. In this task, all of the signal power is contained in the *f*_{1} = 50 Hz frequency, and everything else is due to noise. If the weights are compatible with the task, this means that * w* is a sum of sines and cosines of frequencies

*ω*in some passband which includes

_{k}*f*

_{1}. The narrower we make this bandwidth while still retaining the signal, the higher the SNR of becomes since more noise is filtered out (Appendix A.8).

## 3 Discussion

In this paper, we describe a random generative model for the receptive fields of sensory neurons. Specifically, we model each receptive field as a random filter sampled from a Gaussian process (GP) with covariance structure matched to the statistics of experimental neural data. We show that two kinds of sensory neurons—insect mechanosensory and simple cells in mammalian V1— have receptive fields that are well-described by GPs. In particular, the generated receptive fields, their second-order statistics, and their principal components match with receptive field data. Theoretically, we show that individual neurons perform a randomized transformation and filtering on the inputs. This connection provides a framework for sensory neurons to compute input transformations like Fourier and wavelet transforms in a biologically plausible way.

Our numerical results using these structured random receptive fields show that they offer better learning performance than unstructured receptive fields on several benchmarks. The structured networks achieve higher test performance with fewer neurons and fewer training examples, unless the frequency content of their receptive fields is incompatible with the task. In networks that are fully trained, initializing with structured weights leads to better network performance (as measured by training loss and generalization) in fewer iterations of gradient descent. Structured random features may be understood theoretically as transforming inputs into an informative basis that retains the important information in the stimulus while filtering away irrelevant signals.

### 3.1 Modeling other sensory neurons and modalities

The random feature formulation is a natural extension of the traditional linear-nonlinear (LN) neuron model. This approach may be applied to other brain regions where LN models are successful, for instance sensory areas with primarily feedforward connectivity like somatosensory and auditory regions. The neurons in auditory and somatosensory systems are selective to both spatial and temporal structures in their stimuli [47, 79, 71], and spatial structure emerges in networks trained on artificial tactile tasks [100]. Their receptive fields could be modeled by GPs with spatiotemporal covariance functions [95]; these could be useful for artificial tasks with spatiotemporal stimuli such as movies and multivariate timeseries. Neurons with localized but random temporal responses were found to be compatible with manifold coding in a decision-making task [48]. Our GPs are a complementary approach to traditional sparse coding [63] and efficient coding [5, 16] hypotheses; the connections to these other theories are interesting for future research.

### 3.2 Receptive fields in development

Our generative model offers new directions to explore the biological basis and computational principles behind receptive fields. Development lays a basic architecture that is conserved from animal to animal [90, 88], yet the details of every neural connection cannot be specified [99], leading to some amount of inevitable randomness at least initially [15]. If receptive fields are random with constrained covariance, it is natural to ask how biology implements this. Unsupervised Hebbian dynamics with local inhibition can allow networks to learn principal components of their input [61, 68]. An interesting future direction is how similar learning rules may give rise to overcomplete, nonorthogonal structure similar to what has been studied here. This may prove more biologically plausible than weights that result from task-driven optimization.

The above assumes that receptive field properties actually lie within synaptic weights. For spatial receptive fields, this assumption is plausible [75], but the temporal properties of receptive fields are more likely a result of neurons’ intrinsic dynamics, for which the LN framework is just a model [64, 92, 25]. Heterogeneous physiological (e.g. resonator dynamics) and mechanical (position and shape of mechanosensor relative to body structure) properties combine to give the diverse temporal receptive field structures [7]. Development thus leverages different mechanisms to build structure into receptive field properties of sensory neurons.

### 3.3 Connections to compressive sensing

Random projections have seen extensive use in the field of compressive sensing, where a high-dimensional signal can be found from only a few measurements so long as it has a sparse representation [24, 27, 32]. Random compression matrices are known to have optimal properties, however in many cases structured randomness is more realistic. Recent work has shown that structured random projections with local wiring constraints (in one dimension) were compatible with dictionary learning [26], supporting previous empirical results [6]. Our work shows that structured random receptive fields are equivalent to employing a wavelet dictionary and dense Gaussian projection.

### 3.4 Machine learning and inductive bias

An important open question for both neuroscience and machine learning is why certain networks, characterized by features such as their architecture, weights, and nonlinearities, are better than others for certain problems. One perspective is that a network is good for a problem if it is biased towards approximating functions that are close to the target, known as an *inductive bias*, which depends on an alignment between the features encoded by neurons and the task at hand [11]. Our approach shows that structured receptive fields are equivalent to a linear transformation of the input that can build in such biases. Furthermore, we can describe the nonlinear properties of the network using the kernel, which varies depending on the receptive field structure. If the target function has a small norm in this RKHS, then there is an inductive bias and it is easier to learn [83, 82]. A small norm in the RKHS means that the target function varies smoothly over the inputs. Smooth functions are easier to learn compared to fast varying ones. In this way, the receptive field structure influences the ease of learning of the target function. We conjecture that receptive fields from neural-inspired distributions affect the RKHS geometry such that the target function’s norm is small in that RKHS, compared to the RKHS of random white-noise receptive fields. We leave to future work to verify this conjecture in detail.

Networks endowed with principles of neural computation like batch normalization, pooling of inputs, and residual connections have been found to contain inductive biases for certain learning problems [96, 38]. Learning data-dependent kernels is another way to add in inductive bias [85]. We also saw that initializing fully-trained networks from our generative models improved their speed of convergence and generalization compared to unstructured initialization. This result is consistent with known results that initialization has an effect on generalization [4]. The initialization literature has mostly been focused on avoiding exploding/vanishing gradients [39, 34]. Here, we conjecture that the inductive bias in our structured connectivity places the network closer to a good solution in the loss landscape [99].

The random V1-inspired receptive fields that we model can be seen as similar to what happens in a convolutional neural network (CNN) [62], which have similarities and differences compared to brains [52]. A recent study showed that CNNs with a fixed V1-like convolutional layer are more robust to adversarial perturbations to their inputs [22]. In a similar vein to our work, using randomly sampled Gabor receptive fields in the first layer of a deep network was also shown to improve its performance [42]. The wavelet scattering transform is a multi-layer network where wavelet coefficients are passed through nonlinearities, a model which is similar to deep CNNs [55, 13, 1]. Our framework differs as a randomized model and yields wavelets of a single scale, and similar studies of robustness and learning in deep networks with our weights are possible. Adding layers to our model or sampling weights with a variety of spatial frequencies and field sizes would yield random networks that behave similar to the scattering transform, offering an another connection between the brain and CNNs. Directly learning filters in a Hermite wavelet basis led to good perfomance in ANNs with little data [43], and this idea was extended to multiple scales by [69]. Our structured random features can be seen as an RFN version of those ideas with supporting evidence that these principles are used in biology.

### 3.5 Limitations and future directions

There are several limitations to the random feature approach. We model neuron responses with a scalar firing rates instead of discrete spikes, and we ignore complex neuronal dynamics, neuromodulatory context, and many other details. Like most LN models, the random feature model assumes zero plasticity in the hidden layer neurons. However, associative learning can drive changes in receptive fields of individual neurons in sensory areas like V1 and auditory cortex [35, 30]. Further, our RFN is purely feedforward and cannot account for feedback connections. Recent work suggests that feedforward architecture lacks sufficient computational power to serve as a detailed input-output model for a network of cortical neurons; it might need additional layers with convolutional filters [8]. It can be difficult to interpret the parameters found from fitting receptive field data and connect them to experimental conditions. Also, the GP model of weights only captures covariance (second moments) and neglects higher-order statistics. It remains to be shown how the theory can yield concrete predictions that can be tested *in vivo* experimental conditions.

The random feature receptive field model is a randomized extension of the LN neuron model. The LN model fits a parameterized function to each receptive field [18]. In contrast, the random feature framework fits a distribution to an entire population of receptive fields and generates realistic receptive fields from that distribution. A natural question is how they compare. If the goal is to capture individual differences between neuronal receptive fields, one should resort to an LN model where each neuron’s receptive field is fit to data. The random feature model is not as flexible, but it provides a direct connection to random feature theory, and it is mathematically tractable and generative. This connection to kernel learning opens the door to using techniques which are a mainstay in machine learning theory literature, for instance to estimate generalization error and sample complexity [83], in the context of learning in more biologically realistic networks.

We see several future directions of structured random features in connecting computational neuroscience and machine learning. As already stated, the auditory, somatosensory, and tactile regions are good candidates for further study as well as developmental principles that could give rise to random yet structured receptive field properties. To account for plasticity in the hidden layer, one could also analyze the neural tangent kernel (NTK) associated with structured features [44]. These kernels are often used to analyze ANNs trained with gradient descent when the number of hidden neurons is large and the step size is small [2]. To incorporate lateral and feedback connections, the weights could be sampled from GPs with recurrent covariance functions [58]. Our theory may also help explain why CNNs with fixed V1-like convolutional layer are more robust to adversarial input perturbations [22] as filtering out high frequency corruptions. It seems likely that structured random features will also be more robust. It would be interesting to study intermediate layer weights of fully-trained networks as approximate samples from a GP by studying their covariance structure. Finally, one could try and develop other covariance functions and further optimize these RFNs for most sophisticated learning tasks to see if near high performance—lower error, faster training, etc.—on more difficult tasks is possible.

## Author contributions

Conceptualization: K.D.H.; Mathematical Analysis: B.P., K.D.H.; Data Acquisition: M.P.; Writing–original draft: B.P., K.D.H.; Writing–review & editing: B.P, M.P., B.W.B., K.D.H.; Figures: B.P., B.W.B., K.D.H.; Supervision: B.W.B., K.D.H.

## Funding

B.P. was supported by a UW Applied Math Frederic Wan Endowed Fellowship, Terry Keegan Memorial ARCS Endowment Fellowship, and Natural Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1762114. M.P. was supported by the Janelia Research Campus, Howard Hughes Medical Institute. B.W.B. was supported by grants FA9550-19-1-0386 & FA9550-18-1-0114 from the Air Force Office of Scientific Research. K.D.H. was supported by the Washington Research Foundation postdoctoral fellowship and Western Washington University.

## Acknowledgements

We thank Dario Ringach for providing the macaque V1 data and Brandon Pratt for the hawkmoth mechanosensor data. We are grateful to Ali Weber, Steven Peterson, and Owen Levin for useful discussions. We thank Sarah Lindo, Michalis Michaelos, and Carsen Stringer for help with mouse surgeries, calcium imaging, and data processing, respectively.

## A Appendix

### A.1 Function spaces for wide networks with structured receptive fields

RFNs are intimately connected to a popular class of supervised learning algorithms called kernel methods. As the network width grows, the inner product between the feature representations of two inputs * x, x′* converges to a reproducing kernel

The kernel defines a reproducing kernel Hilbert space (RKHS) of functions. The explicit form of the kernels corresponding to classical RFNs are known for several non-linear activation functions. For example, with the ReLU nonlinearity, no threshold, and unstructured Gaussian weights where [7].

We derive the kernel induced by our RFNs with hidden weights initialized from GPs. In this section we work in the discrete setting, but the continuous version is analogous. By definition (11), network equation (1), and basis change Theorem 1, the kernel for structured features

Thus, the induced kernels from structured weights can be found in terms of unstructured weight kernels acting on the transformed inputs and . Taking ReLU as the nonlinearity for example, we get that .

Every RKHS comes with an inner product and norm . The norm and inner product can be expressed in terms of eigenvalues and eigenfunctions of the kernel itself, analogous to the eigendecomposition of the covariance function of the GP weights. Although it is beyond the scope of our paper to explain the theory in detail, there are well-established results showing that functions with small -norm are easier to learn than those with larger norm for a wide variety of kernel-based algorithms [83, 82]. In ridge regression, this effect is again equivalent to projection and filtering in the kernel eigenbasis, i.e. linear filtering in function space. Finally, end-to-end trained networks where the weights * W* are optimized may be studied with the related neural tangent kernel (NTK) when the step size is small [44]. The basis change formula, Theorem 1, and (12) give us a way to understand the RKHS of the stuctured network in terms of an unstructured network’s RKHS acting on the transformed inputs .

#### A.1.1 Kernel eigenfunctions differ with structured weights

The structured RKHS has eigenfunctions which are different from the eigenfunctions of the unstructured RKHS. To see this, it’s necessary to introduce a probability measure *μ*(* x*) for the data . Kernel learning is often understood [14] in the orthonormal basis for

*L*

^{2}(

*μ*) given by the eigenfunctions

*ψ*of the integral operator defined by

_{i}A natural question to ask is, how does the eigensystems of and compare? The mapping from * x* to induces a pushforward measure on which we will call , and since the mapping is linear . (Note that ), and if

*μ*is multivariate Gaussian, then

*ν*is Gaussian with a different covariance.) Thus the integral operator under the data measure is equivalent to integrating kernel

*k*

_{unstruct}under the pushforward measure. Because are different, there is no general relationship that holds between the eigenfunctions. Different measures leading to different kernel eigenfunctions and eigenvalues can explain why structured weights have strong effects on learning, as shown by recent work demonstrating that areas of low input density are learned more slowly [4].

### A.2 Kernel theory for frequency detection: a fully worked, highly nutritious, simplified example

To demonstrate how the kernel theory can also explain the benefits of structured random features, we developed a simplified frequency detection task. We fully explain the orthonormal basis, target function in that basis, kernels, and how the target function is harder to learn without knowledge of the structure. To our knowledge, this is the first technical result that explains how transforming the input data into an informative, lower-dimensional representation (essentially, preprocessing) leads to improved learning with kernel methods.

#### A.2.1 Data distribution and orthonormal basis

Frequency detection as presented in Appendix A.7.1 is difficult to work with because of the different distributions of *x _{+}* and

*x*samples. Kernel theory requires us to work with basis functions which are orthonormal with respect to the data measure, which in that case would be a mixture. To get around this issue, we simplify the data distribution so that all of our data are Gaussian white noise . The Gaussian measure in 1-d has the natural orthogonal basis of (probabilist’s) Hermite polynomials

_{−}*He*(

_{k}*x*) [1]. The first few of these are

We can normalize these as *ψ _{k}*(

*x*) = (

*k*!)

^{−1/2}

*He*(

_{k}*x*) so that orthonormality under the Gaussian distribution is satisfied. To construct an orthonormal basis in

*d*> 1 dimensions, we take the tensor product of Hermite polynomials, which is orthonormal for the multivariate Gaussian due to the separability of the integrals over each dimension.

#### A.2.2 Target function

The labels *y* = *f**(* x*) are chosen to only depend on the amplitude of frequency

*f*

_{1}in the white noise signal. Writing for the input signal in frequency space, with

**Φ**the discrete cosine transform (DCT) matrix [87], the target function

*f** only depends on . A simplified classfication task would then use where the threshold

*θ*is chosen so that 50% of the points are labeled ±1. However, we need to write the target function

*f** in the polynomial basis (14), and (16) is a step function which has an infinite polynomial expansion. A much simpler target function still captures some of the behavior of (16): It’s negative when the power in frequency

*f*

_{1}is below θ and positive above. We will analyze the quadratic surrogate (17).

#### A.2.3 Kernel decomposition

We consider kernels *k*_{struct}, *k*_{unstruct} of the form
where * k* = (

*k*

_{1},

*k*

_{2},…,

*k*) is a multi-index. We build

_{d}*κ*out of 1-d kernels where the eigenvalue sequence

*ρ*

_{0},

*ρ*

_{1},… is shared across different

*κ*. Each

_{i}*κ*defines an RKHS over functions of a single variable , and defines an RKHS which is the tensor product with eigenvalues given as . These tensor product polynomial kernels were considered in [3] for multiple kernel learning. We normalize the 1-d kernels (19) so that . This implies that by the separability of the eigenvalues, so that the

_{i}*d*-dimensional kernel (18) is also normalized. Furthermore, we can assume that all of the

*ρ*< 1.

_{i}In general, random feature kernels will not be tensor products unless we make very specific choices of weights and nonlinearities. We only use the product structure for convenience with normalization. The form (18) is, on the other hand, quite general and will hold for a variety of random feature kernels.

#### A.2.4 Target function in unstructured and structured networks

Now we analyze the target function in both unstructured and unstructured spaces. Using structured random feature map is equivalent to some deterministic remapping , where **Φ** is an orthogonal matrix encompassing the basis change and **Λ** is a filtering matrix. Like in Section A.8, let’s use stationary bandpass features so that **Λ** just contains d’entries which are equal to 1 with the rest 0, and **Φ** is the DCT. Since the DCT is unitary, . (We use the DCT rather than the DFT to avoid complications dealing with complex variables.) Thus the transformed variables *x* and the original variables both follow similar spherical Gaussian distributions just in different dimensions. This means that the Hermite polynomial basis (14) is an orthonormal basis for both spaces.

Let be the RKHS of functions after applying the transformation . Then has the kernel and eigenbasis under measure . We take to be the RKHS of functions without ever transforming coordinates, i.e. the RKHS with kernel *κ*(* x, x′*) with eigenbasis

*ψ*(

_{k}*) under data measure .*

**x**The target function (17) is simple to express in the eigenbasis of :

Any RKHS with eigenvalues *ρ _{i}* and eigenfunctions

*ψ*defines an inner-product where the

_{i}*L*

^{2}(

*μ*) inner product is with respect to the data measure

*μ*and [2]. The norm of our target function thus becomes

We will see that working with the less informative RKHS leads to a significantly larger norm.

To express the target function (17) in the eigenbasis of we must use the basis change matrix. Let * u* be the vector corresponding to row

*f*

_{1}of

**Φ**

^{T}so that . Thus,

Reading off the coefficients of the basis terms (recall that 1 = *ψ*_{0}(* x*)), we get that
where the terms are equal to 1 because

**Φ**is unitary. Note that so the terms involving sums combined are . Specifically for the DCT-II [87], components , so for any

*f*

_{1}∈ {1,… (

*d*− 1)} which implies that . (For general unit vectors

*we have that 1 by Hölder’s inequality.) Thus for DCT-II the norm is*

**u**#### A.2.5 Comparison of the learning performance in structured and unstructured RKHS

A very standard but rough bound on the generalization performance of kernel ridge regression or classification can be found by analyzing the Rademacher complexity of the class of linear functions in the RKHS [82, 83]. These upper bound the expected loss over new data in terms of the training loss plus an error term, c.f. Theorem 7.39 in [83] for kernel ridge regression. The error term controls the generalization gap between test and training losses and is typically proportional to the (for 0/1 loss) or (for square loss).^{2} Thus, a function with small is *easier to learn* in the *precise sense* that it takes a smaller training set size to achieve a given generalization gap when the norm is smaller.

We have computed and in (23) and (22). Examining those two expressions, we see that . due to two factors: First, there is the splitting of the quadratic target into and *x _{i}x_{j}* terms which creates the extra term in the norm. This arises from

*nonlinear effects*of the quadratic target function on the different components of the input signal. This nonlinear term is , the same order as the other terms, but (23) will be larger than (22) for

*d′*=

*d*when . Secondly, the

*dimension reduction*from

*d*to

*d′*dimensions means that analogous terms in (23) and (22) scale like

*c*and

^{d}*c*, respectively. This means that . Note that the projection is a linear operation but has nonlinear consequences for the norm. The norm in the unstructured kernel space is exponentially larger than the norm in the structured kernel space.

^{d′}Interpreting this exponential norm separation in light of the Rademacher generalization bounds, you would need to train on exponentially more samples with the unstructured versus the structured kernel to acheive the same bound. The majority of this effect is due to the dimension reduction factor, since the nonlinear factor only grows the norm by a constant. However, for target functions with contributions from many higher-order polynomials, the nonlinear factor would have a non-linearly stronger effect. This is an interesting avenue for future research. More precise estimates of generalization performance are possible using theories such as [10]. However, the exponential gap between the performance of these two kernels is fundamental and not an artificial result from shortcomings of the Rademacher analysis.

#### A.2.6 Simulation results support the theory

We ran simulations of the simplified frequency detection task with both kernel methods and random feature networks to check that this simplified task was similar to the task presented in Section 2.3.1 and Appendix A.7.1. These simulations were performed with regression tasks with targets like (17) and classification tasks equivalent to (16). The results are consistent for a broad range of estimators and for both kernel and random feature networks.

For regression tasks, we used a tensor product kernel of the form (18) with kernel ridge regression and the target function . The constant *a* was chosen so that the labels had standard deviation 1, meaning that a mean square error of 1 is equivalent to chance. For varying *n* logarithmically spaced between 24 and 6,400 we generated training and testing sets of size *n*/2 in *d* =1,000 dimensions with *f*_{1} = 16. The training set was used to select the ridge parameter from powers of 10 in the range [10^{−3},…, 10^{3}] by 5-fold cross-validation and error was computed on the test set. Each of these experiments was repeated 20 times. We compare unstructured and structured kernels by either passing in the raw vectors * x*(unstructured) or first performing the DFT and passing in . Here,

`band`is a bandwidth parameter determining the number of components that are used in the structured kernel. When

`band`= 0 we keep only the target frequency component.

The classification tasks were similar. We used the exact target function (16), the range of n from 24 to 3,200, same split into training and test sets, *d* = 100, and *f*_{1} = 16. We classified using a random feature network with 2,000 neurons, ReLU nonlinearity, and zero bias trained with a linear SVM classifier readout. The SVM regularization strength was also swept over the range [10^{−3},…, 10^{3}] by powers of 10 and selected via 5-fold cross-validation on the training set. The accuracy of the selected estimator was then computed on the test set. We compare both classical and mechanosensory receptive fields with bandwidth parameter `band` equivalent to *f*_{lo} = *f*_{1} − `band` and *f*_{hi} = *f*_{1} + `band` and no decay (*γ* = ∞).

The results are shown in Fig. 9. For either regression or classification, the best performance is acheived by the structured method with smallest bandwidth: error increases and accuracy degrades with bandwidth. The unstructured kernel and unstructured random feature methods both only perform at chance levels, square error ≈ 1 or accuracy ≈ 0.5. Similar results occur with kernel SVM on the classification task, kernel support vector regression, and for kernels which are not tensor product kernels (not shown). Furthermore, the random feature network’s limiting kernel is an arc-cosine kernel which is not a tensor product kernel.

These simulations show that simplified frequency detection still exhibits the main features of the other tasks we study while remaining amenable to theoretical analysis. The simulation results are not sensitive to details of the experiment and are qualitatively similar when assumptions of the theory are broken. Getting quantitative predictions of training and testing errors from the theory is left to future work.

### A.3 Covariance parameter optimization

Here we describe the details of how the GP covariances were fit to our various datasets.

#### A.3.1 Mechanosensor covariance

We aim to minimize the difference between the matrix generated by the covariance model **C**_{model} and the data **C**_{data}, while keeping *f*_{lo} smaller than *f*_{hi}. For simplicity, we measure the covariance mismatch with the Frobenius norm, solving

We use the trust region algorithm provided by the `scipy.optimize.minimize` to solve (26).

#### A.3.2 V1 covariance

To fit the covariance model to the data, we formulate an optimization problem over the model parameters *s* and *f*, where we minimize the Frobenius norm of the difference between the covariance matrix **C**_{model} and **C**_{data}:

We solve (27) using the Broyden–Fletcher–Goldfarb–Shannon (BFGS) algorithm provided by the `scipy.optimize.minimize` package.

### A.4 Null receptive field models

We construct null GP models for both mechanosensors and V1 for comparison.

#### A.4.1 Mechanosensor covariance

We compare the data covariance matrix with the unstructured model and the Fourier model (Eq. 5). For the Fourier model, we fit the *f _{hi}* and

*f*parameters to data by minimizing the Frobenius error in the covariance matrix, finding

_{lo}*f*= 75 Hz,

_{lo}*f*= 200 Hz. The resulting covariance matrices are shown in Fig. 10 and Fig. 11. Receptive field samples from the null models are shown in Fig. 12. The samples from the Fourier model are smooth but do not decay in time like biological receptive fields.

_{hi}#### A.4.2 V1 covariance

We compare the data covariance matrix with the unstructured model and the translation invariant version of the V1 model (9). For the translation invariant model, we remove the localizing exponential and only fit the spatial frequency parameter, *f* (finding *f* = 0.73 pixels). The neuron weights are generated from a covariance function of the following form:

The resulting covariance matrices are shown in Fig. 13 and Fig. 14. Receptive field samples from the null models are show in Fig. 15.

### A.5 Derivation of eigenfunctions of V1 covariance function

The covariance between two pixel locations is given by

Since covariance function factors into a product of functions of variables and , the multidimensional eigenfunctions *ϕ _{k}*(

*) and eigenvalues also factor into a product of 1-dimensional eigenfunction and eigenvalues, i.e. and . This holds for*

**t***d*> 2 dimensions as well. So we work in 1-d and search for eigenfunctions and eigenvalues such that, with

*C*(

*t, t′*) =

*e*

^{−α(t−t′)2}

*e*

^{−δ(t2+t′2)}

We make the ansatz that *ϕ _{k}*(

*t*) =

*e*

^{−c1t2}

*H*(

_{k}*c*

_{2}

*t*), where

*H*is the

_{k}*k*Hermite polynomial (physicists’ convention) [8] and

^{th}*c*

_{1},

*c*

_{2}are constants. With this guess for the eigenfunctions,

Solving for the unknown constants leads to the equations

The last step is to find the normalization constant for the eigenfunctions:

Therefore, our orthonormal eigenfunctions and eigenvalues for the 1-dimensional covariance are where . Note that with , so that the spectrum decays exponentially.

### A.6 Distributed receptive field centers imply a sum kernel space

To generate our V1-inspired weights, we first sample a center * c* uniformly at random from the pixels in the image; call this set of pixels

*S*. We will now derive the kernel for this weight sampling.

Suppose that all of the weights are sampled with a single center * c*. Then (12) tells us that the structured kernel associated with the RFN
where we have defined the

*local basis change*

This local basis change projects into a basis of Hermite wavelets **Φ**_{c} centered at * c* and filters according to the eigenvalues

**Λ**

_{c}. The reproducing kernel (30) defines an RKHS of functions which take images as their input and produce a real-valued output. The RKHS is a Hilbert space and thus has a norm . Functions with small are, informally, smooth functions of the local wavelet coefficients .

In our experiments, we actually sample weights from all centers * c* ∈

*S*with equal probability. Taking the expectation over the centers, this means that the kernel will be an average over all of the local kernels (30),

Let be the RKHS associated with *k*_{struct}(·,·), another space of functions that take in images and output a real number. The sum (32) implies that , i.e. the RKHS is a direct sum of local RKHS’s [83]. This means that any function can be written as *f* ∈ ∑_{c∈S}*f _{c}*, with every . The norm of this function comes from taking a minimum over all such decompositions

We can think of functions with small -norm, which will be easiest to learn, as sums of smooth functions of local wavelet coefficients.

### A.7 Timeseries data generation

We detail how the two frequency classification tasks from Sections 2.3.1 and 2.3.2 are generated. In both tasks, each example is an *L* ms timeseries sampled at *f* Hz, making each * x* a vector of length

*d*=

*L*×

*f*. Thus in the discrete setting, we only have

*d*total frequencies. While the math below show continuous signals, in our code we generate analagous discrete signals using the discrete Fourier transform basis.

#### A.7.1 Frequency detection

The frequency detection task from Section 2.3.1 is a binary classification task. The positive examples contain a pure sinusoidal signal with frequency *f*_{1} and additive Gaussian noise. The negative examples are just white noise. They are generated in the following way:
where *ω _{j}* = 2

*πj/L*is the

*j*-th natural frequency, a is a parameter that sets the SNR, and the coefficients

*η*,

_{j}*ξ*,

_{j}*η*,

_{f}*ξ*are random variables uniformly sampled from the unit circle (which gives each frequency component a random phase).

_{f}We define
with *a* ∈ [0, 1]. Larger a means a larger contribution of the pure tone and smaller amplitude noise. Note that . The generation process ensures that the *L*^{2} energy of both the negative and positive examples are matched and that the SNR is equal to the ratio of energy captured in frequency flo to the total energy in all other components.

We generate a balanced dataset with 7,000 timeseries signals which we split into a training set with 5,600 examples and a test set with 1,400 examples. We tuned the GP covariance parameters *f*_{lo}, *f*_{hi}, and *γ* from (8) using 3-fold cross validation on the training set. We found the optimal parameters for a network with 20 hidden neurons and used them for all hidden layer widths. We tested *f*_{lo} and *f*_{hi} parameters from 10 Hz to 200 Hz at increments of 10 Hz. For *γ*, we set the parameter range to be from 10 ms to 100 ms and used all parameters at increments of 10 ms. Using grid search, we tested all combinations of these parameters. The optimal model was refit using all training 5,600 samples, and the errors we report were measured on the test set.

#### A.7.2 Frequency XOR

We use a similar set up to generate the timeseries for the frequency exclusive-or (XOR) task in Section 2.3.2. The positive examples are either frequency *f*_{1} or *f*_{2} Hz pure sinusoids with additive Gaussian noise. The negative examples are either mixed frequency timeseries (with both *f*_{1} and *f*_{2} Hz signals) or pure Gaussian noise. They are generated in the following way:
for *k* ∈ {1, 2} in the *x _{+,k}*(

*t*) function. The constants, random variables, and details of SNR are identical to the frequency detection section. The datasets we generate have balanced proportions of

*x*

_{+,1},

*x*

_{+,2},

*x*

_{−,noise}, and

*x*

_{−,mixed}signals.

### A.8 SNR amplification via filtering

Let’s consider the simple frequency detection task with stationary bandpass features. Since these features are stationary, their eigenvectors are Fourier modes, i.e. **Φ** is the discrete Fourier transform (DFT) matrix. Assume the features encode a bandpass filter, which means that **Λ** = diag(*λ _{i}*),

*i*∈ {0,…,

*d*− 1}, with

*λ*= 1 for

_{i}*i*

_{lo}≤

*i*≤

*i*

_{hi}and 0 otherwise.

In frequency detection, a single frequency component (discrete Fourier mode) contains the signal with energy *a*^{2}. The other *d* − 1 components each have energy , for a total energy of 1 − *a*^{2} contained in this white noise. After the basis change (4) is applied to any input * x*, the transformed vector will have zeros in all entries outside the passband. (We use the conjugate transpose

**Φ***here rather than the transpose since the DFT matrix is complex; the interpretation is the same.) This makes the new representation effectively

*d′*-dimensional, where

*d′*=

*i*

_{hi}−

*i*

_{lo}.

Now, first assume that the signal is within the passband. The total noise energy in the transformed representation becomes , since one of the *d′* components is still signal, and no energy is lost in the retained components because **Φ** is unitary. The overall noise is shrunk by a factor of , so the SNR gets boosted by . In the limiting case where *d′* = 1, the noise energy is 0 and SNR is infinite. On the other hand, if the signal lies outside the passband the SNR is reduced to 0.

### A.9 Implementation details and code availability

In all experiments with RFNs, the training algorithm is an SVM classifier with squared hinge loss provided by the `sklearn.svm.LinearSVC` package and all other parameters set to their defaults. We used `scipy` [17] and `numpy` [9] to construct both classical unstructured and neural-inspired structured weights. For the experiments with fully-trained networks, we used `pytorch` [12]. The cross entropy loss function was optimized using full batch stochastic gradient gescent (SGD) optimizer, i.e. gradient descent (GD). Our code is available at https://github.com/BruntonUWBio/structured-random-features.

### A.10 Covariance of V1 neurons with other stimuli

We repeat the covariance analysis from Section 2.2.3 on three additional datasets of V1 neurons. Different stimuli were shown in each dataset to calculate the receptive field.

The first dataset was provided by Ringach et al. from their work on characterizing the spatial structure of simple receptive fields in macaque (*Macaca fascicularis*) V1 [13]. The spikes of 250 neurons were recorded in response to drifting sinusoidal gratings. The receptive fields were calculated from the stimuli and responses using subspace reverse correlation. Because of the bandlimited properties of sinusoidal stimuli, this experiment biases the reconstruction towards smooth receptive fields. The receptive fields were of various sizes: 32 pixels × 32 pixels, 64 pixels × 64 pixels, and 128 pixels × 128 pixels. We resized them to a common dimension of 32 pixels × 32 pixels using local mean averaging. We find the optimal covariance parameters that fit the data to be *s* = 2.41 and *f* = 0.95 pixels. The covariance matrices and eigenfunctions are shown in Fig. 16. Examples of biological receptive fields and random samples from the fitted model are shown in Fig. 17 in the Appendix.

The second dataset contains the responses of 69,957 neurons recorded from the primary visual cortex of mice bred to express GCaMP6s. We presented 5,000 static natural images of 24 × 27 pixels in random order for 3 trials each. We calculated the receptive fields from the natural images and calcium responses of cells using ridge regression with an *ℓ*^{2} penalty set to 0.1 after each image pixel was z-scored across images. We used the average receptive field over all three trials. For the covariance analysis, we picked cells with SNR > 0.4. This gave us 10,782 cells. The optimal covariance parameters that fit the data are *s* = 5.40 and *f* = 1.17 pixels. Examples of biological receptive fields and random samples from the model are shown in Fig. 19. The covariance matrices and eigenfunctions are shown in Fig. 18. Examples of biological receptive fields and random samples from the fitted model are shown in Fig. 19. Repeating this analysis using receptive fields from individual trials yields identical results.

The third dataset contains the responses of 4,337 neurons also recorded from the primary visual cortex of mice bred to express GCaMP6s. The mice were shown static discrete Hartley transform (DHT, similar to a real-valued discrete Fourier transform) basis functions of size 30 × 80 pixels, and the calcium responses of neurons were recorded. The receptive fields were calculated using ridge regression without any *ℓ*^{2} penalty. Here, we picked cells with SNR > 1 for analysis. We were left with 2,698 cells. The optimal covariance parameters that fit the data are *s* = 10.46 and *f* = 1.20 pixels. The covariance matrices and eigenfunctions are shown in Fig. 20. Examples of biological receptive fields and random samples from the fitted model are shown in Fig. 21.

### A.11 Initialization of networks with structured weights

We show results of initializing fully trained neural networks across a range of network widths (50, 100, 400, and 1,000) and learning rates (10^{−3}, 10^{−2}, and 10^{−1}) in Figures 25, 26, 27, and 28.

### A.12 Deep network experiments

We experimented with using the V1-inspired weight initialization in the first two convolutional layers of AlexNet [50] and training on the ImageNet Large Scale Visual Recognition Challenge from 2012 [78]. Our implementation was based on the example provided by `pytorch` and `torchvision` [12] and used the same optimization routine, parameters, and schedule as in https://github.com/pytorch/examples/tree/master/imagenet.

All convolutional layers were initialized with weights drawn from a Gaussian distribution with variance (*c*_{in}*d*_{x}*d*_{y})^{−1}, where *c*_{in} was the number of input channels, and *d*_{x} and *d*_{y} are the dimensions of the filter. This is equal to the reciprocal of the fan-in. In the case of classical initialization, this Gaussian distribution has covariance proportional to the identity, whereas in the structured case we use the V1-inspired covariance centered in the center of the filter with independent draws for each input channel. All biases and weights in the other layers are set with their `pytorch` defaults. The structured weights were only used in the first two convolutional layers of dimensions *d*_{x} × *d*_{y} = 11×11 and 5×5. The size parameter was set to *s* = max(*d*_{x}, *d*_{y}) · 3 and frequency bandwidth was *f* = max(*d*_{x}, *d*_{y})/5.

We show training and testing loss over the first 10 epochs for both the classical and structured initializations in Fig. 29. The structured initialization at first shows an advantage over classical, with consistently lower losses for the first 4 epochs, but eventually the classical network catches up. From this point onwards (until the 90 training epochs are complete), the classical network has the same or lower loss. Both networks end up performing well, reaching accuracies close to those reported in [50] and the `torchvision` documentation (https://pytorch.org/vision/stable/models.html), as shown in Table 3. The classical initialization performs slightly better overall.

These null results are perhaps not surprising: The initial layers of AlexNet contain only 64 and 192 output channels (i.e. filters) respectively, making up only a small fraction of the total weights in the network. The deeper convolutional layers contain many more channels and are built with small 3× 3 filters where our initialization is unlikely to help. It is also possible that the effects of initialization are less important for overparametrized models or with large amounts of training data.

## Footnotes

Major revision. Changes to main experiments and additional kernel theory in appendix.

↵1 The V1 weights have all eigenvalues nonzero, but the spectrum decays exponentially, so it acts as a lowpass filter.

↵2 The trace of the kernel matrix also appears in these bounds, but this is unity due to normalization (20).

## 4 References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵