The optimal spatial arrangement of ON and OFF receptive fields

Many sensory systems utilize parallel ON and OFF pathways that signal stimulus increments and decrements, respectively. These pathways consist of ensembles or grids of ON and OFF detectors spanning sensory space. Yet encoding by opponent pathways raises a question: How should grids of ON and OFF detectors be arranged to optimally encode natural stimuli? We investigated this question using a model of the retina guided by efficient coding theory. Specifically, we optimized spatial receptive fields and contrast response functions to encode natural images given noise and constrained firing rates. We find that the optimal arrangement of ON and OFF receptive fields exhibits a second-order phase transition between aligned and anti-aligned grids. The preferred phase depends on detector noise and the statistical structure of the natural stimuli. These results reveal that noise and stimulus statistics produce qualitative shifts in neural coding strategies and provide novel theoretical predictions for the configuration of opponent pathways in the nervous system.


INTRODUCTION
Across many sensory systems, neurons encode information about either increments or decrements of stimuli in the environment: so-called 'ON' and 'OFF' signals. This division between ON and OFF signaling has been observed in visual (Joesch et al. 2010, Kuffler 1953, thermosensory (Gallio et al. 2011), auditory (Smith and Lewicki 2006), olfactory (Chalasani et al. 2007), and electro-sensory (Bell 1990) systems. This organization has the advantage that neurons can be tasked with signaling changes in the environment rather than the steady-state or mean stimulus condition, thereby resulting in more efficient neural codes (Barlow 1961, Gjorgjieva et al. 2014. Moreover, when the number of potential stimuli is large, neurons often specialize: for example, they only respond to a small region of visual space or a narrow auditory frequency band. The combination of these coding strategies raises two questions: First, how should specialized detectors, the ON and OFF cells, arrange themselves most efficiently cover stimulus space? Second, what is the optimal relative arrangement of ON and OFF detector grids? For one system, the retina, the answer to the first question is clear from previous work: detectors tile stimulus space and exhibit overlap near the 1-sigma boundary of a Gaussian receptive field (Borghuis et al. 2008, Devries and Baylor 1997, Doi et al. 2012, Gauthier et al. 2009a. The answer to the second question, what might be called the "sensor alignment problem," has received comparatively little attention and is the focus of this study. Conceptually, there are three general possibilities for how the sensor alignment problem could be solved: One possibility is that the grids of sensors are statistically independent, meaning the locations of receptive fields in one grid provide no information about the receptive field locations in the other grid. A second possibility is that the two grids are aligned, meaning the receptive field centers in one grid are closer than expected by chance. The third possibility is that the two grids are anti-aligned, meaning the receptive field centers in the two grids are further apart than expected by chance. On general information theory grounds, the optimal solution is likely to depend on noise in the encoding process and the statistics of the encoded stimuli (Atick andRedlich 1990, 1992).
To determine the optimal solution to the sensor alignment problem, we rooted our analysis in efficient coding theory (Barlow 1961).
This theory argues that sensory systems should aim to reduce the redundancy present in sensory input while minimizing metabolic costs, thereby, reliably encoding natural stimuli with fewer spikes.
Efficient coding theory has been successful at explaining many aspects of sensory processing and retinal physiology, including center-surround receptive fields, grids of receptive fields (henceforth called 'mosaics'), and a greater proportion of OFF than ON cells (Atick and Redlich 1992, Barlow 1961, Doi et al. 2012, Karklin and Simoncelli 2011, Ratliff et al. 2009). Thus, we asked whether efficient coding theory might predict the optimal spatial arrangement of ON and OFF receptive field mosaics within the retina. Our approach to this question involved optimizing a model that approximates the processing performed by many retinal ganglion cells (Karklin and Simoncelli 2011). By maximizing the mutual information between an (input) library of natural images and (output) spike rates, we could study the effects of image statistics and encoding noise on the optimal arrangement of ON and OFF mosaics.
In this model, we found that the optimal spatial arrangement was a pair of approximately hexagonal mosaics of ON and OFF receptive fields. However, surprisingly, the relative alignment of these mosaics depended on the input noise, output noise, and the statistics of the natural image set.
When output noise was low, the mosaics were aligned, with ON and OFF receptive fields centered at nearby locations more often than expected by chance. When output noise was relatively high, anti-alignment became the favored arrangement.
Surprisingly, the content of the image set also strongly influenced the transition between aligned and anti-aligned mosaics. In particular, when image sets contained more 'outlier' images with particularly large luminance or contrast values, anti-alignment became the favored state for fixed input and output noise. As we demonstrate analytically and confirm computationally, the boundary between these two regimes is characterized by a second-order phase transition: as noise parameters or stimulus statistics vary, mutual information changes smoothly, while the optimal mosaic arrangements undergo a sudden, qualitative shift. Finally, we confirm these predictions by showing that systematic manipulations of the training data set change the phase boundary in a manner predicted by an analytical model. These findings underscore the crucial role played by both noise and the statistics of natural stimuli for understanding specialization and coordination in sensory processing.

Linear-nonlinear efficient coding model
The model used to examine efficient coding is based on previous work (Karklin and Simoncelli 2011), designed to find the optimal encoding of visual scenes. The model takes patches of natural scenes (Doi and Lewicki 2007) as the input, each of which has 18x18 (324) pixels. Each input image was multiplied by a circular mask to avoid cases in which corners in the optimization region produce corner artifacts that dominate the final configuration. The circular masking effectively left about 255 pixels in each input image patch. The model consisted of linear-nonlinear (LN) architecture with learned linear filters and nonlinearities. Given an input stimulus x, an input noise n x , and the linear filter w j for each neuron j, the nonlinearities were modeled as softplus functions of the linearly filtered noisy stimulus for each neuron y j = w j (x + n x ): using a fixed value of β = 2.5. We empirically chose a fixed value of β = 2.5 which produced a stable optimization trajectory. The output of the model consisted of 100 units that mimic retinal ganglion cells (RGCs), whose firing rates are modeled using neuron-specific gain and threshold parameters γ j and θ j , plus an additive output noise n r,j : Note that scaling r, γ, and n r by the same factor leaves the model invariant, so we choose units in which the mean firing rate is Er = 1. We used a first-order approximation that treats the conditional distribution of activation as a Gaussian, which allowed us to derive a closed-form expression for the mutual information. A detailed derivation is provided in the supplement. The model was trained to maximize the mutual information between the input images and the firing rates of the output neurons. To model the metabolic cost of spiking, the average firing rate of each neuron across images was constrained to 1, which was enforced via an augmented Lagrangian method with quadratic penalty. Input and output noise were modeled as i.i.d. Gaussian noise (zero means, and standard deviations of σ in and σ out respectively).
The parameters of the model were the receptive field filter weights,the gain of the nonlinearity, and the nonlinearity threshold of each output RGC. To preserve positive gain for each output unit, we optimized the logarithms of the corresponding parameters. Filter weights were initialized by drawing samples from a Gaussian distribution (mean zero, standard deviation 1; see Fig 1B, top row). The parameters of the nonlinearities (log gain and threshold) were initialized by sampling from a uniform distribution [0, 1]. The parameters were then optimized using stochastic gradient descent with a learning rate of 0.001. During training, we calculated gradients using mini-batches of 100 image patches at a time, renormalizing the filters to have unit norm after each gradient step. Models were trained for 1,000,000 iterations, which we found sufficient in practice to ensure convergence of the mutual information. As a means of escaping local optima, between 200,000 and 500,000 iterations, two modifications were applied to the optimization procedure, which we call jittering and centering. These were used to speed convergence: The jittering operation was applied every 5,000 iterations and consisted of raising each element of the kernels to the power 1.25; this makes high-amplitude portions of the filters more pronounced while attenuating low-amplitude portions. The centering operation penalizes the spatial spread of the kernel by adding the mean of the spatial variance of each kernel as an additional loss term; this encourages the kernels to be localized around their centers. These two operations allowed kernels that were trying to 'squeeze' into an already formed mosaic the space to do so, effectively speeding the optimization and allowing all kernels to contribute to the encoding. After these operations, the model was optimized without the modifications for another 500,000 iterations. We confirmed that jittering and centering do not alter the final converged shape of the kernels.

Median nearest neighbor analysis of the spatial relationships between the mosaics
To quantify the relative alignment of ON and OFF mosaics, we used a metric based on the median nearest-neighbor distances between heterotypic receptive field centers. For each RF center, we found the nearest RF center with the opposite polarity to define its heterotypic nearest neighbor distance, and calculated the median of these values(See Equation 3). When ON and OFF mosaics are nearly aligned, this value is close to zero as in Fig 2A-B.

One-shape model for rapid exploration of mosaic arrangements
The full 18x18 system required many hours to optimize, motivating the development of a smaller, scaled-down system for exploring how mosaic arrangements depended on different noise parameters (Fig 3) or different stimulus statistics (Fig 6). Here, we restricted input images to 7x7, fixed the number of ON and OFF cells at 7 each (14 total), and adopted a fixed parametric form for the receptive field. Thus, the optimization only learned receptive field center locations and a small number of receptive field shape parameters. The receptive field function was parameterized as a difference of two Gaussian curves, with a third parameter determining their relative magnitude.
Optimization in this setup involves fewer parameters, as well as allowing us to use more efficient first-order optimization methods like Adam (Kingma and Ba 2014).
To examine the dependence of mosaic arrangement on noise a grid search was performed over combinations of input and output noise values, with input noise (σ in ) ranging from 0.0 to 0.75 and output noise (σ out ) from 0.75 to 3.0. As previously argued (Karklin and Simoncelli 2011), these values like within the physiological range for naturalistic stimuli. The optimization was run three times for every pair of noise values and the model resulting in the highest mutual information was retained Fig 3. This model was also used to examine the dependence of mosaic alignment on the statistics of natural scenes Fig 6. In these simulations the input noise was held fixed while the output noise and distribution of natural images was varied to examine the interaction between these factors on the resulting mosaic arrangements.

Linear-nonlinear efficient coding model of the retina
We asked how mosaics of ON and OFF cells should be arranged to efficiently encode natural scenes. To answer this question, we trained a model of retinal processing to maximize the mutual information between a library of natural images and a set of output firing rates (Fig 1A). The model consists of a single layer of linear-nonlinear units intended to approximate the encoding performed by retinal ganglion cells (Baccus and Meister 2002, Borghuis et al. 2008, Chichilnisky 2001, Doi et al. 2012, Karklin and Simoncelli 2011, Meister and Berry 1999. These model cells linearly filter the input stimulus weighted by a spatial mask or 'receptive field' and sum the resulting values. This number, loosely analogous to a membrane potential, is then fed through a nonlinearity to approximate rectified neural firing rates. The model also included two sources of noise: (1) a pixel-wise input noise, which models stochasticity in phototransduction; and (2) an output noise, which models stochasticity in output firing rates.
A key difference between the two is that the input noise is subject to the nonlinearity, and thus its effect becomes stimulus-dependent, while the output noise is not. We trained the model by optimizing mutual information under the constraint of a 1 activation (over all images) for each neuron (see Methods). This constraint is intended to model the metabolic cost of generating action potentials.

Spatial tiling of receptive fields is predicted by efficient coding
Following training, units in the model exhibited either ON-center or OFF-center receptive fields with strongly rectified nonlinearities (Fig 1B-C). Each population of ON-center units and OFF-center units (referred henceforth as ON cells and OFF cells, respectively) exhibited a mosaic-like organization: their receptive fields tiled space (Fig 1D) (Karklin and Simoncelli 2011). When varying noise levels, we reliably found circular ON and OFF receptive fields with input noise between 0.00 and 0.75 and output noise between 0.75 and 3.00. Input and output noise levels outside these ranges produced irregular receptive field shapes (Sup Fig 1). Therefore, we restricted the analysis to this range of noise regimes. A. Efficient coding model architecture. Images of natural scenes (plus input noise σin) are multiplied by linear filters wj, passed through a nonlinear function hj, and perturbed by output noise σout, resulting in a firing rate rj for neuron j. B. Examples of initial and optimized filters and nonlinearities, where the filters are initialized from a white noise distribution and converge to ON and OFF kernels with a center-surround pattern. The nonlinearities are initialized as unscaled (but slightly perturbed) softplus functions and converge to nonzero threshold values. C. A plot of 100 kernels from a trained model with σin = 0.4, σout = 3.0. D.A contour plot showing the tiling of the ON and OFF kernels. The contours of the two types of kernels are drawn where the normalized pixel intensity is ± 0.21. Orange circles and blue x's indicate RF centers of mass for ON and OFF cells, respectively. The scale bar shows a 1-pixel distance.

Spatial arrangements of receptive field mosaics depend on input and output noise level
We next analyzed how these optimized mosaics of ON and OFF receptive fields were spatially arranged with respect to one another. Each optimized mosaic formed a nearly hexagonal grid (Fig 1D). Analysis of the spatial relationships between the mosaics in Fig  1D-E revealed they were anti-aligned (Fig 2D). However, as we explored the effect of changing the amounts of input and output noise on the optimization of this model, we observed other mosaic combinations that were aligned (e.g. Fig 2D). Generally, small pairs (input and output) of noise values yielded aligned mosaics, while larger paired noise values produced anti-aligned mosaics (Fig 2A-D). Importantly, for a fixed amount of input and output noise, multiple instances of the optimization produced consistent results, indicating that mosaic alignment and anti-alignment depended on the specific amount of input and output noise, and did not depend on different initializations to the optimization. These results motivated a more thorough analysis of how the relative spatial organization between the two mosaics depended on input and output noise. . The first two parameter sets result in aligned mosaics, while the latter two, at higher levels of output noise, are anti-aligned. The scale bar shows a 1-pixel distance.

A simplified 'one-shape' model captures the noise dependence of mosaic arrangements
To examine the dependence of the mosaic arrangements on noise over a range of densely sampled values, we utilized a simpler 'one-shape' model that trained much more quickly than the full model (see Methods). This model assumed that all receptive fields shared a common shape (a difference of concentric Gaussians), while optimizing over both the parameters of this shape and the locations of the receptive fields (see Methods). The learned kernels of this simplified model also exhibited ON and OFF RFs (Sup Fig 2A) and ON and OFF RF mosaics (Sup Fig  2B). In addition, the optimal radial kernel exhibited a center-surround organization (Sup Fig 2C). To further speed training, we also reduced the number of units in the model to 7 ON and 7 OFF cells (their polarities were fixed during the optimization). These simplifications allowed us to rapidly judge if the optimization preferred alignment or anti-alignment (Fig 3). To ensure these simplifications did not strongly bias the results, we compared the mosaic arrangements for particular input and output noise combinations across the one-shape model and the full model. In all cases examined, the two models produced matching aligned or anti-aligned results (Sup Fig 3). Thus, the one-shape model is a useful proxy for larger-scale and more general optimizations and could be used to reliably determine if mosaics were aligned or anti-aligned for a given set of noise parameters.

Mosaics exhibit a phase change as a function of input and output noise
To determine the optimal arrangement of ON and OFF mosaics for different amounts of input and output noise, we performed a grid search over input and output noise values, using the one-shape model. We ran the model with input noise ranging from 0.00 to 0.75 in steps of 0.05, and output noise from 0.75 to 3.00 in steps of 0.10. Optimizing the location of seven RF centers in a circular space nearly always resulted in a single kernel in the center surrounded by the remaining six kernels at the vertices of a hexagon. (see Fig 3). When output noise level was low, ON and OFF RF mosaics were aligned as in the first row of Fig 3B and resulted in near-zero median nearest neighbor distances (dark purple colored area in Fig 3A), whereas ON and OFF RF mosaics were anti-aligned under higher output noise levels as in the last row of Fig 3B and resulted in larger median nearest neighbor distances (lighter colored area in Fig 3A). In particular, this analysis reveals an abrupt transition between mosaic alignment and anti-alignment ( Fig 3A). Thus, our model optimization indicates the solution to the sensor alignment problem depends on noise present in the encoding process. As argued below, this is a second-order phase transition (Binney et al. 1992). Below, we build an analytical model for understanding this result.

Retinal phase transitions from an analytic model of efficient coding
To understand the factors that give rise to the phase transition between aligned and anti-aligned mosaics, we developed a simplified 1-dimensional analytic model in which ON and OFF grids have identical nonlinearities, fixed kernels, fixed spacing, and are only allowed to shift relative to one another (see supplementary section 2). Moreover, we assume that images are drawn from a Gaussian distribution with a 1/f frequency spectrum that is the 1D equivalent of the two-dimensional scale-free distribution of natural images (Atick and Redlich 1990, Ruderman 1994).
We verified via simulation that, even with all of these restrictions, the phase transition occurs just as in the less constrained models in both one and two dimensions, suggesting that the remaining free parameters -gain, threshold, and A. As output noise increases, optimal mosaic configurations shift from aligned to anti-aligned, and this holds over a range of input noise values. Color indicates median NN distance, with darker indicating more clearly aligned (see Equation 3 and Methods). The existence of a phase boundary between the two alignments is clear. B. Examples of optimal mosaic arrangements for representative input and output noise combinations. Symbols denote the corresponding locations in 3A. The scale bar shows a 1-pixel distance. Note the existence of optimal configurations between aligned and fully anti-aligned (hearts) for some parameter values.
mosaic alignment -are sufficient to account for the phase transition (see Sup Fig 5 and Sup Fig 6)).
To gain additional insight into the origins of the transition, we analyzed the behavior of this model in the more analytically tractable limit of near-independent neurons. The starting point for this analysis is a recognition that the mutual information between stimulus and neural response can be conveniently written as a sum of single-neuron terms and higher-order corrections: Here, H(X) is the (constant) entropy of the stimulus, 2N is the number of ON + OFF cells, I 1 is the contribution of each neuron individually, and I 2 and I 2 are correction terms due to coactivation (and thus redundancy) of ON-ON/OFF-OFF (same polarity) and ON-OFF (opposite polarity) cells, respectively (see supplementary section 2B). Importantly, these corrections are always negative, and vanish for widely separated cells. For this analysis, we do not consider third-order and higher terms, since the terms above are sufficient to model the transition from alignment to anti-alignment. In the following two sections we show (1) that noise and image statistics set the optimal neuron response thresholds, and (2) that these thresholds control the amount of redundancy in the population responses, thereby dictating whether mosaic alignment or anti-alignment is the preferred state.

Output noise and image outliers drive increases in optimal neuron response thresholds
First, we considered optimizing only the I 1 term above, which captures the contribution of each neuron individually to encoding stimulus information.
In particular, we focused on the influence of output A-B. A single neuron's mutual information contribution as a function of threshold. Stars mark optimal thresholds for multiple parameter values. A. When the input distribution is Gaussian, the optimal threshold increases with higher output noise, even as overall information decreases. B. When the input distribution has heavier tails, modeled with Student's t distributions with varying degrees of freedom, the optimal threshold again increases with heavier tails. C. An intuitive schematic using the distribution of natural images linearly projected onto activation levels of a neuron, to show the effects of increased noise or outliers. With increased noise (upper right), low-magnitude activations become uninformative, driving the threshold to increase. Similarly, a distribution with heavier tails (lower right) would require a higher threshold to maintain the same activation probability, since more mass is contained in outliers. That is, both output noise and heavy tails lead to higher degrees of optimal rectification. noise and the distribution of natural images on the output nonlinearities of the neurons (we focused on output noise because the transition depended much more on output than input noise, see Fig 3A). As the output noise of neurons increases, maximizing I 1 dictates that neurons should respond by increasing their firing threshold (Fig 4A). This allows the neurons to reserve spikes for stimuli with high signal-to-noise ratios (Laughlin 2001).
Thus, units respond less often but do so with higher gain when active, thereby mitigating the impact of noise.
In addition, the optimal firing threshold depends on the nature of the image distribution. For distributions with more outliers (see Methods), thresholds increase as the tail of the distribution becomes heavier (Fig 4B). Intuitively, this is because heavier-tailed distributions require higher thresholds to maintain the same overall response probability. Finally, it can be shown that decreasing the input noise has the effect of increasing the effective threshold (see supplementary section 2B, Equation 18). Thus, the optimal spiking threshold is determined both by the input and output noise and the distribution of natural scenes (schematized in Fig 4C) (Gjorgjieva et al. 2014).

Mosaic anti-alignment achieves redundancy reduction at high neuron thresholds
Next, we considered the effect of ON-OFF pair corrections to Equation 5. These terms, represented by I 2 above, capture the effects of redundancy in ON-OFF encoding: they are always negative and are the only contributions to mutual information that change as a function of alignment. Thus, while the overall trend A. Normalized pairwise information correction (I 2 /N p 2 1 ) for pairs of ON and OFF neurons as a function of mosaic shift. One unit is the spacing between nearest-neighbor RF centers of the same polarity. With low output noise, an aligned position (i.e., no shift) is the only placement to avoid a large negative information correction, while with higher output noise, an ON-OFF pair can still have a finite shift within the gray area without incurring any penalty. B. The total normalized pairwise information correction as a function of mosaic shift, assuming equispaced ON and OFF RF centers filling the 1-D space. With low output noise, any mosaic shift lowers mutual information, while with high output noise, the pairwise mutual information becomes less negative when the mosaic is shifted by 0.5, i.e. when anti-aligned. C. The optimal mosaic shift as a function of output noise, showing the phase transition. For σout = 0.1.7, σout = 2, the optimal mosaic shift lies between 0 (alignment) and 0.5 (anti-alignment), resembling the arrangements we obtained in the middle row of Fig 3B. D. Optimal 2-D mosaics in a low-noise regime. The gray circle indicates the area within which an OFF cell can be located with zero pairwise correction to mutual information. E. Optimal 2-D mosaics in a high-noise regime. Here, the gray area has grown (as in A), allowing three neighbors of the opposite polarity to fit inside without loss in mutual information.
is that mutual information decreases with increasing threshold, mosaic configurations may be more or less efficient for information transmission as they limit the impact of these terms. More specifically, Where d ij is the distance between ON cell i and OFF cell j, p 2 is the probability that the pair is coactive, and 0 < R < 1 is a function that is approximately independent of both output noise and neuron nonlinearity (see supplementary section 2B). This correction term is small and negative for large separations, large and negative around the intra-mosaic spacing, and almost zero in a region around zero inter-cell distance (Fig 5A). Importantly, the width of this 'almost zero zone' around zero inter-cell distance widens as output noise grows larger. This is explained by the fact that, as output noise increases, so does the threshold (Fig 4A), and at large threshold values, nearby opposite polarity pairs are almost never coactive (see supplementary section 2B).
This analysis gives rise to the following account of the phase transition: at low levels of output noise or stimulus outliers, neuronal thresholds are low, and only aligned RFs avoid the large I 2 penalty. Thus, in one dimension, each ON cell is non-redundant (save one bit of sign information; supplementary section 2C) with its aligned OFF partner, but image correlations make it highly redundant with the two OFF (and two ON) cells on either side. Likewise, in two dimensions, each ON cell is non-redundant (save one bit of sign information; see supplementary section 2C) with its aligned OFF partner, but it suffers large I 2 penalties to encoding efficiency from its six other OFF neighbors.
However, as activity thresholds rise, a finite-sized region develops around each cell within which other RFs of opposite polarity are almost never coactive (Fig 5A,  gray).
This region of non-redundancy grows with threshold, to the point that, when it is sufficiently large, more than one OFF cell can fit inside when the mosaics are anti-aligned (Fig 5D-E, gray circle). Thus, anti-alignment results in lower redundancy (via reduced I 2 penalties) and increased mutual information. This picture is born out quantitatively by the model: as output noise (and thus threshold) increases, the optimal shift interpolates smoothly between aligned and anti-aligned configurations (Fig 5B-C and cf. Fig 3B, second row), suggesting a second-order Landau-Ginzburg phase transition (Binney et al. 1992).

Outliers in natural image statistics modify the transition from aligned to anti-aligned mosaics
The analysis above identifies output noise and image (stimulus) outliers as driving optimal encoding strategies toward higher neural response thresholds, which we claim is the key factor mediating the phase transition from aligned to anti-aligned mosaics. However, this analysis makes numerous simplifying assumptions, including a restriction to one dimension. Thus, we sought to test whether these predictions generalized to a less-restricted two-dimensional model in which image statistics were manipulated.
First, we identified image patches that generated the highest and the lowest firing rates in our fitted model (Sup Fig 4). We identified these images using the firing rates under the two noise regimes, one that produced aligned and one that produced anti-aligned mosaics (Fig 2B (σ in = 0.1, σ out = 1.0) and Fig 2D (σ in = 0.4, σ out = 3.0)). In both noise regimes, the 100 image patches producing the highest firing rates were extreme luminance values (nearly all black or all white) or extreme contrast values (e.g. contained one or more edges between black and white regions). On the other hand, the 100 images producing the lowest firing rates were nearly homogeneous gray patches (Sup Fig 4). A 2-D histogram of the mean and standard deviation across the pixels of all of the available sample image patches confirmed that these images exhibited outlier mean or contrast values (Sup Fig 4E). Thus, under both noise regimes, the largest firing rates (across all neurons) were produced by rare, outlier images.
We next tested what impact these outlier images exert on the phase transition between aligned and anti-aligned mosaics. We re-ran the 'one-shape' model (e.g. Fig  3A), but with the altered image sets that either reduced or increased the frequency of outlier images. From the distribution of all of the 18,587,848 possible image patches in the dataset (Fig 6A), we considered a 2-D space composed of the z scores of the patch mean and SD values, and we drew the unit circle in this 2-D space. Within this space, we considered regions within 1 to 3 standard deviations from the mean of the average pixel intensity (per image) and from the mean of the variance over pixel intensities (per image). We then trained the one-shape model (see Fig 3) on only the patches within each boundary (Fig 6B). As predicted, we found that the transition between aligned and anti-aligned mosaics occurred at lower noise levels as more outlier images were included in the training set and occurred at higher noise levels when these outlier images were removed. Furthermore, these changes were mediated by an increase in activation threshold as outliers grew more prevalent (Fig 6C). Thus, we were able to reproduce the effects of outlier images on the phase transition between aligned and anti-aligned mosaics.

DISCUSSION
Efficient coding theory provides a basis for understanding many features of early sensory coding. This includes the structure of individual receptive fields, how those receptive fields adapt, and the mosaic-like arrangement of receptive fields to uniformly tile space (Atick and Redlich 1992, Barlow 1961, Doi et al. 2012, Karklin and Simoncelli 2011, Ratliff et al. 2009). We have extended this framework to investigate how the receptive fields of distinct cell types should be spatially arranged, given sensory noise and the statistics of natural stimuli. We term this the 'sensor alignment problem. ' We discovered a second order phase transition between aligned and anti-aligned mosaics, in which the preferred phase depends on noise and image statistics. We show that this phase transition arises because both sensor noise and heavy-tailed stimulus distributions promote greater thresholds (rectification) in the output nonlinearities of the individual ON and OFF cells. Higher thresholds decorrelate the responses of nearby cells, causing an abrupt transition between aligned and anti-aligned configurations that occurs when the decorrelation region grows large enough to include multiple nearest neighbors.
This relationship between the response threshold and the phase transition can be understood in terms of redundancy reduction, specifically, reduction of co-activation of opposite polarity neurons. As our analysis in supplementary section 2B shows, when an ON-OFF pair of adjacent neurons are coactive, it means they encode an overlapping portion of the image distribution and are thus redundant. This diminishes information efficiency. Therefore, the best strategy to minimize information redundancy is for each neuron to minimize the number of opposite polarity adjacent neurons with which it is coactive. At a low threshold, this optimization only allows one cell at the same spatial location (Fig 5D), while at a high threshold, Ellipses in the histogram represent data within 1-3 standard deviations of the mean, which we use to construct artificial datasets with varying contributions of outliers. B. Optimized ON and OFF receptive field mosaics using the image patches within the boundaries denoted in A with σin=0.4, σout=1.4 to 2.0, plotted with the same scheme as in 3. As predicted, when trained on image sets that systematically excluded outliers (i.e., had lighter distributional tails), phase transitions happened only at larger values of output noise. The fourth column denoted as "all patches" used the full dataset as in 3. The last column, denoted as "boosted outliers," used an augmented dataset to which we added horizontal and vertical mirror flips of patches outside the 1.5 SD circle, thereby creating a distribution with even more outliers. This corresponds to the top 20% of z-scores among all patches in the dataset. Here, the phase transition happened at lower output noise values. The scale bar shows a 1-pixel distance. C. Optimal thresholds as a function of output noise, using input distributions with different levels of outliers. As predicted, the optimal thresholds tended to increase as the input distribution contained more outliers, allowing the phase transition to happen at a lower output noise level.
this optimization allows three cells (Fig 5E). In the latter case, a high threshold of nonlinearity enhances efficient coding even when their receptive fields have substantial (but not perfect) overlap (Pitkow and Meister 2012). It is worth noting that there is one bit of information shared between an ON and OFF neuron at the same spatial location (the sign of the stimulus) but the contribution from this is negligible compared to a pair of coactive neurons which redundantly encode much richer information about the input signal (supplementary section 2C). The above analysis provides an intuitive yet rigorous reasoning as to why anti-alignment is favored between the two mosaics when the threshold is high.
Our approach is closely related to several strands of work on efficient coding in the retina. As noted above, efficient coding as redundancy reduction subject to signaling costs gave theoretical weight to the idea of center-surround receptive fields as whitening filters for visual stimuli (Atick and Redlich 1990, 1992, Barlow 1961, Laughlin 2001, Srinivasan et al. 1982 and are thus matched to the statistics of natural scenes. However, as later work has made clear, nonlinearities, particularly response thresholds, are perhaps even more important in optimizing retinal codes (Gjorgjieva et al. 2019, Pitkow and Meister 2012. In addition, noise in both sensory transduction (inputs) and neural outputs affects coding efficiency (Atick and Redlich 1990, 1992, Röth et al. 2020). This work, by placing all of these factors in an optimization-based framework (Karklin and Simoncelli 2011), allows investigating the relative importance of each in the optimal spatial layout of receptive fields. This problem has received comparatively less attention than efficient coding in the temporal domain (Ocko et al. 2018, Palmer et al. 2015, Sederberg et al. 2018. As in previous work, we found that noise and image statistics set optimal neural response thresholds and that these drive redundancy reduction. However, the crucial link between mosaic structure and response thresholds has not previously been demonstrated. Importantly, this analysis relies only on information maximization arguments and the optimization does not rely on fitting data from electrophysiological recordings, as required by recent deep learning models of retinal coding (Maheswaranathan et al. 2018, Ocko et al. 2018.
By focusing on the factors that govern the transition between aligned and anti-aligned mosaics from an optimal encoding standpoint, we have necessarily ignored several real biological complexities.
Most importantly, we have not addressed whether mosaics in the retina are in fact aligned or anti-aligned. Most anatomical studies of retinal mosaics indicate they are statistically independent (Wässle et al., 1981;Rockhill et al., 2000;Eglen, 2012; but see Jang and Paik, 2017). However, this work uses cell body locations to judge the spatial arrangements between the mosaics, and cell body positions do not strictly determine the locations of receptive field centers. Recently, we have shown in rat and primate retinas that pairs of RF mosaics from ON and OFF RGCs that encode similar visual features are anti-aligned (Roy et al. 2021). This indicates that either retinal noise is relatively high and/or that retinal processing is optimized for low signal-to-noise conditions such as detecting dim or low contrast stimuli (Dhingra et al. 2003, Field andSampath 2017). We have also found that measured RF mosaics from ON and OFF RGCs that encode distinct visual features appear to be statistically independent (Roy et al. 2021). Our current efficient coding model only generates two types of units that encode similar visual features, but with opposite polarity, thus we are not yet able to determine whether RGC types that encode distinct visual features should be independent or coordinated.
Developing an efficient coding model that is optimized to encode natural movies may yield a greater diversity of cell types (Ocko et al. 2018). Many RGC types that encode particular space-time features such as direction of motion or 'looming' detectors also form mosaics (Vaney et al. 2012). Whether these detector grids are coordinated or should be coordinated remains an open question. Finally, our results motivate understanding the implications of efficient coding theory to other other sensory systems such as touch receptors on the skin: might they be spatially coordinated for the efficient coding of touch (Kuehn et al. 2019)? More generally, these observations demonstrate that efficient coding theory can make predictions about emergent properties present in the organization of the nervous system, such as how large populations composed of multiple cell types should be arranged to optimally encode the natural environment. Figure 1. Circular ON and OFF receptive fields are robustly obtained only within specific input and output noise regimes.
A. With input noise in the range between 0.00 to 0.75 and output noise between 0.75 to 3.00, circular center-surround ON and OFF receptive fields are obtained reliably. B. Outside these ranges, i.e., when input noise is higher than 0.75 or output noise is lower than 0.75, either Gabor-shaped or noisy kernels were obtained. A. A plot of 100 kernels that maximize information efficiency when all kernels are constrained to take the same shape but are free to move their centers (cf. Fig 1C B. The radial shape function is parameterized as a difference of Gaussian curves. Learned radial shape parameters are: a=0.3332, b=0.1559, c=0.4121. C. Learned ON kernels sharing the same radial shape fill the visual field (similar to Fig 1D). D. Learned OFF kernels sharing the same radial shape fill the visual field (similar to Fig 1E). The scale bar shows a 1-pixel distance. Most of the image patches with the highest firing rates are outliers of the overall image patch distribution, containing high contrast and edges. This is also evident in firing rate outputs: Using the kernels optimized under the two noise regimes, as shown in Fig 2B and Fig 2D, we calculate the firing rates of 100,000 randomly sampled image patches and overlay the top 100 patches with the highest firing rates over the z-score histogram introduced in Fig 6A. A. 100 image patches with the highest firing rates under σin=0.1, σout=1.0. Images are either extremely dark or contain distinct edges, including those with more complicated edges compared to C, and had firing rates of 3.07±0.23. B. 100 image patches with the lowest firing rates under σin=0.1, σout=1.0. The 100 images with the lowest firing rates were also visually similar and had firing rates of 0.161±0.003. C. 100 image patches with the highest firing rates under σin=0.4, σout=3.0. Images are either dark or contain distinct edges, with firing rates of 4.39±0.54. D. 100 image patches with the lowest firing rates under σin=0.4, σout=3.0. Patches were almost entirely gray, with firing rates of 0.036±0.001. E. A 2-D histogram showing the mean and standard deviation of all of the available sample image patches (in orange, occurrences in log scale, same as Fig 6A) with the image patches in A (blue x) and the image patches in B (green o). In the low-noise condition, the outliers are a greater portion of the tail (green o).

A. Mutual information in the linear-nonlinear model
We follow the conventions of Karklin and Simoncelli (2011). Let x ∈ R D be a vectorized representation of an input image. We assume J neurons characterized by linear filters w j ( w j = 1) for j = 1 . . . N , which form the columns of a matrix W ∈ R D×J . We assume pixelwise input noise to each image n x ∼ N (0, σ 2 in ) and output noise in firing rates n r ∼ N (0, σ 2 out ) so that the random input image and firing rate are X = x + n x and R = r + n r , respectively. With these conventions, the firing rate for the jth neuron given input x is with γ j the neuron gain, θ j its response threshold, and h the softplus nonlinearity described in Methods. Note that, in principle, this firing can be negative for sufficiently large and negative n r,j . As in Karklin and Simoncelli (2011), we notheless treat Eq 7 as a useful approximation. Under these assumptions, the covariance matrix of firing rates given input image is ) a matrix of derivatives of the nonlinearities. As in Karklin and Simoncelli (2011), we make a local Gaussian approximation to the image distribution with covariance C x , allowing us write the posterior p(x|r) as Gaussian with covariance (9) and the conditional (differential) entropy as Since H(X) is an unknown constant, maximizing the mutual information I(X; R) = H(X) − H(X|R) involves minimizing Eq 10 subject to a cost per spike that sets constrains the mean firing rate of each neuron: where the optimization objective follows from Eqs 9 and 10 via the Matrix Inversion Lemma and we implement the constraint using an augmented Lagrangian method as detailed in Methods. Finally, the following rewriting of Eq 11 will be useful, particularly in our analysis of the single-neuron case:

B. Analytic model of alignment in 1-D
To better understand the optimization problem posed in Eq 11, particularly the contributions of kernel shape, image statistics, noise, and response thresholds to the transition from aligned to anti-aligned mosaics, we consider a simplified model that nonetheless exhibits the same behavior. In particular: 1. We restrict ourselves to a one-dimensional space in which ON and OFF mosaics are assumed to be an equidistant array of RF centers (distance = 1) for both mosaics. The ON cells are located at positions z j = j, j ∈ Z, and the mosaics are allowed to rigidly shift relative to each other, so that OFF cells are located atz j = j + φ, φ ∈ [0, 0.5].
2. We assume a fixed kernel shape for all neurons, meaning that receptive fields differ only in their center locations: w j (z) = ±κ(|z − z j |) for a kernel shape function κ(z), where z j is the location of j-th neuron's kernel center.
4. Images x are assumed to follow a multivariate Gaussian distribution N (0, C x ), where the covariance between locations z i and z j only depends on |z i − z j |, defined as a function C x (z).
In analogy with the finding that the amplitudes of natural images decrease as ∼ 1/k with k the spatial frequency (power spectral density ∼ 1/k 2 ), for our numerical examples, we likewise assume a one-dimensional scale-free distribution ∼ 1/ √ k (power spectrum ∼ 1/k). As shown in Fig S5B, the phase transition still takes place even with this Gaussian image assumption. Now, for a pair of neurons j and k, the scalars w T j w k and w T j C x w k depend only on |z j − z k |, the distance between their centers, which we write in the frequency domain as: . and where C x (ω) and κ(ω) are Fourier transforms of C x (z) and κ(z), respectively. Using any w, we can define the constants f 0 and g 0 for the special case where z j = z k : where we used our assumption that the kernel w is a unit vector in Eq 15. An example of these functions is shown in Sup Fig 7. The single-neuron case: I1 We first consider the mutual information capacity of a single neuron in isolation. Since we have assumed all units to be identical, the same expression holds for all ON and OFF cells, and we let I 1 def = −H(X|R) in this single-channel case. From Eq 12, this quantity is where is the probability that the neuron is active, i.e., that w T (x + n x ) > θ. Now, since x + n x is Gaussian with mean 0 and covariance C x + C in , we have w T (x + n x ) ∼ N (0, g 0 + σ 2 in ) and where Φ and S are the cumulative distribution function and the survival function of the standard normal distribution, respectively. Note that if we define ς 2 def = g 0 + σ 2 in to denote the signal variance andθ def = θ/ς to denote the effective threshold, we see that the effective threshold increases when signal variance decreases. We can calculate γ as: where ϕ is the probability density function of the standard normal distribution. Substituting this to Eq 17, we get: which was used to plot the single-neuron mutual information contributions in Fig 4A. A single-neuron case with heavy tails To investigate the relationship between threshold and outliers in the image distribution, we performed the analysis in Fig 4B, which replaced the Gaussian image distribution with a Student's t-distribution. This allowed us to adjust the heaviness of the tails of the distribution by changing the degrees of freedom from ν = 4 (heavy tails) to ν = ∞ (Gaussian). In this case, a formula similar to Eq 20 can be derived in a similar manner to Eq 19.
More explicitly, if x ∼ t ν (0, C x ), then w T (x + n x )/ς is likewise t ν -distributed, and we have: where Φ ν is the cumulative distribution function of the t-distribution. Moreover, where ϕ ν and S ν are respectively the probability density function and the survival function of Student's t-distribution with ν degrees of freedom. We see that ν+θ 2 ν−1 → 1 as ν → ∞, in which case Eq 22 is identical to Eq 19. These results can then be substituted into Eq 17 to derive an expression analogous to Eq 20.

Decomposition of the mutual information
In previous sections, we have considered the encoding properties of a single neuron in isolation.
Here, we return to the full optimization objective in Eq 11, showing how we can rewrite this as a sum of independent, single-neuron contributions plus pairwise (and higher-order) correction terms.
We begin by noting that the expectation in Eq 11 over images x involves a sum over patches that only activate subsets of neurons. More specifically, with the hard rectifying nonlinearity assumed in the previous section, we have G(x) = γI{w T (x + n x ) > θ} for ON cells and the same for OFF cells with θ → −θ. Here, I is the indicator function, which is 1 when its argument is true and 0 otherwise. Thus, the gain matrix is diagonal, with entries equal to either 0 or γ, and the matrix products involving G(x) in Eq 11 are thus low-rank for most x. As a result, we can rewrite the expectation as a sum over log determinants of low-rank matrices, each of which corresponds to a specific collection of active neurons.
For example, consider a three-neuron system, whose activity can be coded as a binary triple, (e.g., 010). Let A be the determinant in Eq 12. Letting p (triplet) denote the probability that each subset is activer over the image set and A (triplet) be the corresponding determinant ratio, the expectation can be expanded as: The first term corresponds to the case when no neuron is active, for which log A 000 = 0. For the next three terms, each of the probabilities in parentheses is a marginal probability of a neuron being active. That is, it is equal to p 1 as calculated above. Similarly, the next three terms constitute pairwise corrections to the first set of terms: they involve marginal probabilities of pairwise activation, and their log ratios vanish when the individual neurons are independent (e.g., A 110 = A 100 A 010 ). Thus, as the distance between a pair of receptive field centers increases beyond the image correlation length scale, these correction terms should vanish. In our analysis, we focus on the limit in which only firstand second-order interactions are non-negligible. This is an assumption that does not, in fact, hold for natural images (or even our Gaussian images with long-range correlations), though as we shall see, these interactions are sufficient to demonstrate the existence of a phase transition. Moreover, as Figures 5 and 6 show, the same phase transition occurs in simulation when images with long-range correlations are used, suggesting that the principles we identify in the pairwise limit generalize to the full model. Returning to our 1-D model with N ON and N OFF neurons, we note that, as per Eq 5, we can further decompose the pairwise corrections into two terms, one corresponding to ON-ON and OFF-OFF corrections, the other to ON-OFF pairs. More explicitly, if we denote by a j andā j the binary variables corresponding to whether the ON cell at location j and the OFF cell at location j + φ are active, respectively, we can write the mutual information up to pairwise corrections as: (25) where the first term contains I 1 from Eq 20, and the next two terms correspond to pairwise interactions between neurons of the same and different polarities, respectively. Note here that we have used translational invariance to write the pairwise corrections as the interaction between a single ON cell at 0 and all other cells. Likewise, since the ON and OFF cells are identical up to a polarity flip, the I 2 term includes both mosaics' self-interactions.
Clearly, only the last term in Eq 25 is a function of φ, and we only need to consider the value φ that maximizes: to determine whether alignment (φ = 0) or anti-alignment (φ = 0.5) is preferred. We know from Eq 17 that and we can calculate the two-neuron determinant ratiō A 2j as follows: First define matrices P and Q from Eq 11 to write then substitute kernels w 0 and −w j (negative because of the opposite polarity): Supplementary Figure 8. Plots of the correlation-like quantities ρ, ρ , and ρ as defined in Eqs 37 and 33 and the log ratio logĀ 2j A 2 1 = log 1−(ρ ) 2 1−(ρ ) 2 , with respect to color-coded σout values. These depend very weakly on σout, suggesting the Ex[a0āj] term in Eq 26 plays a more crucial role in the phase transition.
for θ = 0, σ in = 0.4, σ out = 1, g 0 ≈ 3.8. However, this is differential entropy. To relate this to the answer in the 1-bit case, we need to consider a quantization of the signal. If we assume that the signal can be quantized using 1 bit for its sign and n bits for its magnitude, we have, for n sufficiently large (Cover 1999) H(X ∆ |R) ≈H(X|R) + (n + 1) log 2 where X ∆ is the quantized variable, and we denote the differential entropy byH. From this result, it is clear that, while far-separated RFs encode precisely 2H(X ∆ |R), perfectly aligned RFs encode only 2H(X ∆ |R) − log 2, where the correction is due to the shared sign bit. Thus, while it is correct that aligned RFs do exhibit some redundancy (1 bit for θ = 0, less as θ increases), this is typically much smaller than the information about stimulus magnitude. More importantly, when considering how to align ON and OFF cells, informational losses due to shared sign bits are dwarfed by the size of I 2 for neighbors located outside the gray zone in Fig 4A.