Efficient coding, channel capacity and the emergence of retinal mosaics

Among the most striking features of retinal organization is the grouping of its output neurons, the retinal ganglion cells (RGCs), into a diversity of functional types. Each of these types exhibits a mosaic-like organization of receptive fields (RFs) that tiles the retina and visual space. Previous work has shown that many features of RGC organization, including the existence of ON and OFF cell types, the structure of spatial RFs, and their relative arrangement, can be predicted on the basis of efficient coding theory. This theory posits that the nervous system is organized to maximize information in its encoding of stimuli while minimizing metabolic costs. Here, we use efficient coding theory to present a comprehensive account of mosaic organization in the case of natural videos as the retinal channel capacity—the number of simulated RGCs available for encoding—is varied. We show that mosaic density increases with channel capacity up to a series of critical points at which, surprisingly, new cell types emerge. Each successive cell type focuses on increasingly high temporal frequencies and integrates signals over large spatial areas. In addition, we show theoretically and in simulation that a transition from mosaic alignment to anti-alignment across pairs of cell types is observed with increasing output noise and decreasing input noise. Together, these results offer a unified perspective on the relationship between retinal mosaics, efficient coding, and channel capacity that can help to explain the stunning functional diversity of retinal cell types.


Introduction
The retina is one of the most intensely studied neural circuits, yet we still lack a computational understanding of its organization in relation to its function. At a structural level, the retina forms a three-layer circuit, with its primary feedforward pathway consisting of photoreceptors to bipolar cells to retinal ganglion cells (RGCs), the axons of which form the optic nerve [1]. RGCs can be divided into 30-50 functionally distinct cell types (depending on species) with each cell responsive to a localized area of visual space (its receptive field (RF)), and the collection of RFs for each type tiling space to form a "mosaic" [2,3,4,5]. Each mosaic represents the extraction of a specific type Preprint. Under review. of information across the visual scene by a particular cell type, with different mosaics responding to light increments or decrements (ON and OFF cells), high or low spatial and temporal frequencies, color, motion, and a host of other features. While much is known about the response properties of each RGC type, the computational principles that drive RGC diversity remain unclear.
Efficient coding theory has proven one of the most powerful ideas for understanding retinal organization and sensory processing. Efficient coding posits that the nervous system attempts to encode sensory input by minimizing redundancy subject to biological costs and constraints [6,7]. As more commonly formulated, it seeks to maximize the mutual information between sensory data and neural representations, with the most common cost in the retinal case being the energetic cost of action potentials transmitted by the RGCs. Despite its simplicity, this principle has proven useful, predicting the center-surround structure of RFs [8], the frequency response profile of contrast sensitivity [9], the structure of retinal mosaics [10,11], the role of nonlinear rectification [12], different spatiotemporal kernels [13], and inter-mosaic arrangements [14,15].
While previous studies have largely focused on either spatial or temporal aspects of efficient coding, we optimize an efficient coding model of retinal processing in both space and time to natural videos [16]. We systematically varied the number of cells available to the system and found that larger numbers of available cells led to more cell types. Each of these functionally distinct types formed its own mosaic of RFs that tiled space. We show that when and how new cell types emerge and form mosaics is the result of tradeoffs between power constraints and the benefits of specialized encoding that shift as more cells are available to the system. We show that cell types begin by capturing low-frequency temporal information and capture increasingly higher-frequency temporal information over larger spatial RFs as new cell types form. Finally, we investigated the relative arrangement of these mosaics and their dependence on noise. We show that mosaic pairs can be aligned or anti-aligned depending on input and output noise in the system [14]. Together, these results demonstrate for the first time how efficient coding principles can explain, even predict, the formation of cell types, and which types are most informative when channel capacity is limited.

Model
The model we develop is an extension of [14], a retinal model for efficient coding of natural images, which is based on a mutual information maximization objective proposed in [10]. The retinal model takes D-pixel patches of natural images x ∈ R D corrupted by input noise n x ∼ N (0, C nin ), filters these with unit-norm linear kernels {w j ∈ R D | ∥w j ∥ = 1} j=1,··· ,J representing J RGCs, and then feeds the resulting signals y j = w ⊤ j (x+n in ) through softplus nonlinearities η(y) = log 1 + e βy /β (we used β = 0.25) with gain γ j and threshold θ j . Finally, these signals are further corrupted by additive output noise n out ∼ N (0, C nout ), to produce firing rates r j : r j = γ j · η(y j − θ j ) + n out,j . (1) The model learns parameters w j , γ j , and θ j to maximize the mutual information between the inputs x and the outputs r, under a mean firing rate constraint: subject to E[r j ] = 1.
Here C x is the covariance matrix of the input distribution, W ∈ R D×J contains the filters w j as its columns, the gain matrix G = diag γ j dη dy | yj −θj , and the noise covariances are C nin = σ 2 in 1 D×D and C nout = σ 2 out 1 J×J . This objective is equivalent to the formulation in [10], which assumes normally distributed inputs and locally linear responses in order to approximate the mutual information in a closed form.
Here, we extend this model to time-varying inputs x(t) ∈ R D representing natural videos ( Figure  1A-B), which are convolved with linear spatiotemporal kernels {w j (t)} j=1,··· ,J : We additionally assume that the convolution kernels are separable in time and space: and the temporal kernels are unit-norm impulse responses taking the following parametric form: where α j , α ′ j , τ j > 0, τ ′ j > 0 are learnable parameters, and n ∈ N is fixed. Previous work assumed an unconstrained form for these filters, adding zero-padding before and after the model's image inputs to produce the characteristic shape of the temporal filters in primate midget and parasol cells [13], but this zero-padding represents a biologically implausible constraint, and the results fail to correctly reproduce the observed delay in retinal responses [17,18,19]. Rather, optimizing (2) with unconstrained temporal filters produces a filter bank uniformly tiling time (Supplementary Figure 5).
By contrast, (6) is motivated by the arguments of [20], which showed that the optimal minimum-phase temporal filters of retinal bipolar cells, the inputs to the RGCs, take the form when ωτ ≪ 1. Thus, we model RGC temporal filters as a linear combination of these forms. In practice, we take only two filters and use n = 6 rather than n = 3, since these have been shown to perform well in capturing observed retinal responses [19]. The results produced by more filters or different exponents are qualitatively unchanged (Supplementary Figure 7). For training on video data, we use discrete temporal filters and convolutions with Finally, while unconstrained spatial kernels w j converge to characteristic center-surround shapes under optimization of (2) ( Figure 1C), for computational efficiency and stability, we parameterized these filters using a radially-symmetric difference of Gaussians where r measures the spatial distance to the center of the RF. The parameters a j , b j , c j determining the spatial kernel shape, as well as the center location, are different for each RGC j, allowing individual RGCs to have different RF sizes and center-surround strengths. The result of optimizing (2) using these forms is a set of spatial and temporal kernels ( Figure 1D-E) that replicate experimentallyobserved shapes and spatial RF tiling.
3 Efficient coding as a function of channel capacity: linear theory Before presenting results from our numerical experiments optimizing the model (2, 3), we begin by deriving intuitions about its behavior by studing the case of linear filters analytically. That is, We assume a single gain γ for all cells, no bias (θ = 0), and a linear transfer function η(y) = y. As we will see, this linear analysis correctly predicts the same types of mosaic formation and filling observed in the full nonlinear model. Here, we sketch the main results, deferring full details to Appendix A.

Linear model in the infinite retina limit
For analytical simplicity, we begin by assuming an infinite retina on which RFs form mosaics described by a regular lattice. Under these conditions, we can write the log determinants in (2) as integrals and optimize over the unnormalized filter v ≡ γw subject to a power constraint: where C x (k) is the Fourier transform of the stationary image covariance C x (z − z ′ ), the integral is over all frequencies k ∈ G 0 unique up to aliasing caused by the spatial regularity of the mosaic, and the sums over g account for aliased frequencies (Appendix A.1). In [8], the range [−π, π] is used for the integral, corresponding to a one-dimensional lattice and units of mosaic spacing ∆z = 1.
Now, solving the optimization in (9) results in a spatial kernel with the spectral form (Appendix A.2) where k = ∥k∥ and ν is chosen to enforce the constraint on total power. This is exactly the solution found in [8], linking it (in the linear case) to the model of [10,11]. Note, however, that (10) is only nonzero within G 0 , since RF spacing sets an upper limit on the passband of the resulting filters.
The generalization of this formulation to the spacetime case is straightforward. Given a spacetime stationary image spectrum C x (z − z ′ , t − t ′ ) and radially-symmetric, causal filter w(z, t), the same infinite retina limit as above requires calculating determinants across both neurons i, j and time points t, t ′ of matrices with entries of the form Again, such matrices can be diagonalized in the Fourier basis, with the result that the optimal spacetime filter once again takes the form (10) Figure 2A depicts the frequency response of this filter in d = 1 spatial dimensions, with corresponding spatial and temporal sections plotted in Figures 2B-C.

Multiple cell types and the effects of channel capacity
Up to this point, we have only considered a single type of filter v(k, ω), corresponding to a single cell type. However, multiple cell types might increase the coding efficiency of the entire retina if they specialize, devoting their limited energy budget to non-overlapping regions of frequency space. Indeed, optimal encoding in the multi-cell-type case selects filters v and v ′ that satisfy v * (k, ω)v ′ (k, ω) = 0, corresponding encoding independent visual information (Appendix A.4).  This result naturally raises two questions: First, how many filter types are optimal? And second, how should a given budget of J RGCs be allocated across multiple filter types? As detailed in Appendix A.5, we can proceed by analyzing the case of a finite retina in the Fourier domain, approximating the information encoded by a mosaic of J RGCs with spatial filters given by (10) and nonoverlapping bandpass temporal filters that divide the available spectrum (e.g., Figure 2B, C). Following [21], we approximate the correlation spectrum of images by the factorized power law C x (k, ω) ≃ A k α ω 2 with α ≈ 1.3 and find that in this case, the optimal filter response exhibits two regimes as a function of spatial frequency (Supplementary Figure 1A): First, below k f = A/σ 2 in ω 2 1/α , the optimal filter is separable and log-linear, and the filtered image spectrum is white: where ν, the Lagrange multiplier in (9) that enforces the power constraint, scales as 1/P for small values of maximal power P and 1/P 2 for larger values (Supplementary Figure 1D). Second, for k ≳ k f , the filter response decreases as k −α/2 until reaching its upper cutoff at k c = k f /(νσ 2 out ) 1/α , with the filtered image spectrum falling off at the same rate (Supplementary Figure 1B).
But what do these regimes have to do with mosaic formation? The link between the two is given by the fact that, for a finite retina with regularly spaced RFs, adding RGCs decreases the distance between RF centers and so increases the resolving power of the mosaic. That is, the maximal value of k grows roughly as k ∼ √ J in d = 2, such that larger numbers of RGCs capture more information at increasingly higher spatial frequencies (Supplementary Figure 1A). However, while information gain is roughly uniform in the whitening regime, it falls off sharply for k ≳ k f (Supplementary Figure 1C), suggesting the interpretation that the k ≲ k f regime is a "mosaic filling" phase in which information accumulates almost linearly as RFs capture new locations in visual space, while the k ≳ k f regime constitutes a "compression phase" in which information gains are slower as RFs  shrink to accommodate higher numbers ( Figure 2D). Indeed, one can derive the scaling of total information as a function of J: where P 0 is the power budget per RGC and J f is the RGC number corresponding to k = k f . Thus, mosaic filling exhibits diminishing marginal returns ( Figure 2E), such that new cell types are favored when the marginal gain for growing mosaics with lower temporal frequency drops below the gain from initiating a new cell type specialized for higher temporal frequencies. Moreover, the difference between these gain curves implies that new RFs are not added to all mosaics at equal rates, but in proportion to their marginal information ( Figure 2F). As we demonstrate in the next section, these features of cell type and mosaic formation continue to hold in the full nonlinear model in simulation.

Experiments
We analyzed the characteristics of the optimal spatiotemporal RFs obtained from the model (2, 3) trained on videos from the Chicago Motion Database [22]. Model parameters for spatial kernels, temporal kernels, and the nonlinearities were jointly optimized using Adam [23] to maximize (2) subject to the mean firing rate constraint (3) using the augmented Lagrangian method with the quadratic penalty ρ = 1 [24]. Further technical details of model training are in Appendix E.
As previously noted, the power spectral density of natural videos can be well approximated by a product of spatial and temporal power-law densities, implying an anticorrelation between high spatial and temporal frequency content [21]. Supplementary Figure 4 shows the data spectrum of the videos in our experiments is also well-approximated by separable power-law fits. To examine the effect of these statistics on the learned RFs, we divided the dataset into four progressively smaller subsets by the proportion of their temporal spectral content below 3 Hz, their spectral attenuation. Using values of 70%, 80%, and 90% then yielded a progression of datasets ranging from most videos to only the slowest videos (Supplementary Figure 3A, B). Indeed, when the model was trained on these progressively slower data subsets, it produced only temporal smoothing filters, whereas the same model trained on all videos produced a variety of "fast" temporal filter types (Supplementary Figure  3C). We also note that these experiments used unconstrained spatial kernels in place of (8), yet still converged on spatial RFs with typical center-surround structure as in [10,15,14]. Thus, these preliminary experiments suggest that the optimal encoding strategy-in particular, the number of distinct cell types found-depends critically on the statistics of the video distribution to be encoded.

Mosaics fill in order of temporal frequency
As the number of RGCs available to the model increased, we observed the formation of new cell types with new spectral properties ( Figure 4). We characterized the learned filters for each RGC in terms of their spectral centroid, defined as the center of mass of the Fourier (spatial) or Discrete Cosine (temporal) transform. Despite the fact that each model RGC was given its own spatial and temporal filter parameters (8,6), the learned filter shapes strongly clustered, forming mosaics with nearly uniform response properties ( Figure 4A-C). Critically, the emergence of new cell types shifted the spectral responses of previously established ones, with new cell types compressing the spectral windows of one another as they further specialized. Moreover, mosaic density increased with increasing RGC number, shifting the centroids of early mosaics toward increasingly higher spatial frequencies. This is also apparent in the forms of the typical learned filters and their power spectra: new filters selected for increasingly high-frequency content in the temporal domain ( Figure 4D).
We likewise analyzed the coverage factors of both individual mosaics and the entire collection, defined as the proportion of visual space covered by the learned RFs. More specifically, we defined the spatial radius of an RF as the distance from its center at which intensity dropped to 20% of its peak and used this area to compute a coverage factor, the ratio of total RF area to total visual space (π/4 of the square's area due to circular masking). Since coverage factors depend not simply on RGC number but on RF density, they provide an alternative measure of the effective number of distinct cell types learned by the model. As Figure 4E shows, coverage increases nearly linearly with RGC number, while coverage for newly formed mosaics increases linearly before leveling off. In other words, new cell types initially increase coverage of visual space by adding new RFs, but marginal gains in coverage diminish as density increases. In all cases, the model dynamically adjusts the number of learned cell types and the proportion of RGCs assigned to them as channel capacity increases.

Phase changes in mosaic arrangement
In addition to retinal organization at the level of mosaics, a pair of recent papers reported both experimental [15] and theoretical [14] evidence for an additional degree of freedom in optimizing information encoding: the relative arrangement of ON and OFF mosaics. Jun et al. studied this for the case of natural images in [14], demonstrating that the optimal configuration of ON and OFF mosaics is alignment (RFs co-located) at low output noise levels and anti-alignment (OFF RFs between ON RFs and vice-versa) under higher levels of retinal output noise. Moreover, this transition is abrupt, constituting phase change in optimal mosaic arrangement.
We thus asked whether learned mosaics exhibited a similar phase transition for natural video encoding.
To do so, following [14], we repeatedly optimized a small model (J = 14, 7 ON, 7 OFF) for multiple learned filter types while systematically varying levels of input and output noise. In each case, one ON-OFF pair was fixed at the center of the space, while the locations of the others were allowed to vary. We used RF size D = 8 2 pixels for Slow and D = 12 2 for FastA and FastB to allow the size of spatial kernels to be similar to those of the previous experiments, and we imposed the additional constraint that the shape parameters a j , b j , and c j in (8) are shared across RGCs.
Under these conditions, the six free pairs of RFs converged to either aligned (overlapping) or anti-aligned (alternating) positions along the edges of the circular visual space, allowing for a straightforward examination of the effect of input and output noises on mosaic arrangement. Figure  5A-C shows that the phase transition boundaries closely follow the pattern observed in [14]: increasing output noise shifts the optimal configuration from alignment to anti-alignment. Moreover, for each of the tested filters, increasing input noise discourages this transition. This effect also follows from the analysis presented in [14], since higher input noise increases coactivation of nearby pairs of RFs, requiring larger thresholds to render ON-OFF pairs approximately indpendent (Appendix B).

Discussion
Related work: As reviewed in the introduction, this study builds on a long line of work using efficient coding principles to understand retinal processing. In addition, it is related to work examining encoding of natural videos [25,22,16] and prediction in space-time. The most closely related work to this one is that of [13], which also considered efficient coding of natural videos and considered the tradeoffs involved in multiple cell types. Our treatment here differs from that work in several    In all three cases, the optimal configuration changes from aligned to anti-aligned when output noise increases or input noise decreases. Blue bars denote alignment as measured by median distance to the nearest RF center of the opposite type. key ways: First, while [13] was concerned with demonstrating that multiple cell types could prove beneficial for encoding (in a framework focused on reconstruction error), that study predetermined the number of cell types and mosaic structure and only optimized their relative spacing. By contrast, this work is focused on how the number of cell types is dynamically determined, and how the resulting mosaics arrange themselves, as a function of the number of units available for encoding (i.e., the channel capacity). Specifically, we follow previous efficient coding models [8,9,10,11] in maximizing mutual information and do not assume an a priori mosaic arrangement, a particular cell spacing, or a particular number of cell types-all of these emerge via optimization in our formulation. Second, while the computational model of [13] optimized strides for a pair of rectangular arrays of RGCs, we individually optimize RF locations and shapes, allowing us to study changes in optimal RF size and density as new, partial mosaics begin to form. Third, while [13] used zero-padding of natural videos to bias learned temporal filters toward those of observed RGCs, we link the form of temporal RFs to biophysical limits on the filtering properties of bipolar cells, producing temporal filters with the delay properties observed in real data. Finally, while [13] only considered a single noise source in their model, we consider noise in both photoreceptor responses (input noise) and RGC responses (output noise), allowing us to investigate transitions in the optimal relative arrangement of mosaics [14,15].
We have shown that efficient coding of natural videos produces multiple cell types with complementary RF properties. In addition, we have shown for the first time that the number and characteristics of these cell types depend crucially on the channel capacity: the number of available RGCs. As new simulated RGCs become available, they are initially concentrated into mosaics with more densely packed RFs, improving the spatial frequency bandwidth over which information is encoded. However, as this strategy produces diminishing returns, new cell types encoding higher-frequency temporal features emerge in the optimization process. These new cell types capture information over distinct spatiotemporal frequency bands, and their formation leads to upward shifts in the spatial frequency responses of previously formed cell types. Moreover, pairs of ON and OFF mosaics continue to exhibit the phase transition between alignment and anti-alignment revealed in a purely spatial optimization of efficient coding [14], suggesting that mosaic coordination is a general strategy for increasing coding efficiency. Furthermore, despite the assumptions of this model-linear filtering, separable filters, firing rates instead of spikes-our results are consistent with observed retinal data. For example, RGCs with small spatial RFs exhibit more prolonged temporal integration: they are also more low-pass in their temporal frequency tuning. Second, there is greater variability in the size and shape of spatial RFs at a given retinal location, but temporal RFs exhibit remarkably little variability in our simulations and in data [19]. Thus, these results further testify to the power of efficient coding principles in providing a conceptual framework for understanding the nervous system.

A Analysis details for the linear model A.1 Derivation of the objective in the infinite retina limit
Here, we provide details for the derivation of (9). To begin, we assume that spatial RFs centered at locations z i are circularly symmetric: w(z i , z) = w(|z i − z|). Moreover, we assume that the RF centers form a Bravais lattice R with basis vectors (a 1 , a 2 ) in d = 2 such that z i = n 1i a 1 + n 2i a 2 for some integers (n 1i , n 2i ) and the mosaic is translationally invariant under these shifts. Likewise, there exist basis vectors (b 1 , b 2 ) for the dual lattice G such that a i · b j = 2πδ ij , with the interpretation that while the RF mosaic is invariant under shifts by integer multiples of a i , its representation in Fourier space is invariant under shifts by the b i . Note that this is distinct from the assumption that the encoded images themselves are invariant under such translations. Here, for simplicity, we will transition to an infinite retina, though we will relax this assumption later. Now, consider the determinants in (2). In the linear, discrete purely spatial case, the numerator contains a determinant over J × J matrices with elements which can be written as in the continuous case, where we have made use of the translational and rotational symmetry of both w and C x to write their Fourier transforms in terms of k = ∥k∥. Of course, in the continuum case, J → ∞, and we have for a symmetric, positive-definite matrix A with λ the eigenspectrum of A in the continuum limit.
Fortunately, the continuum expression (13) can be diagonalized in the Fourier basis. Let ψ j (k ′ ) = e ik ′ ·zj . Then where again G = {pb 1 +qb 2 | p, q ∈ Z} is the dual lattice, vol(G 0 ) = det is the volume of the unit cell of the dual lattice, and z i · g is an integer multiple of 2π by definition of the lattices R and G. Likewise, we note that, due to aliasing from the mosaic spacing, ψ(k + g) = ψ(k), so the only unique eigenvalues are those with k ∈ G 0 , the unit cell of the dual lattice. Thus, by a similar calculation for the denominator, the terms in (2) become g∈G |v(k + g)| 2 σ 2 in + σ 2 out in the continuum limit.
As for the constraint term, the restriction (3) fails to generalize to the linear case, where Er j = 0. Instead, we note that for (1), we have by Hölder's inequality. That is, we can enforce a looser restriction on firing rates by bounding the power used by the filter. Of course, in the linear case considered above, θ = 0, η(y) = y, and E[y] = 0, so that the inequality becomes In practice, we bound the square of this expression, which yields the continuous objective (9).
A.2 Derivation of the optimal linear filter As noted in Section 3.1 the optimal solution for (9) takes the form (10). However, this expression differs in two key aspects from the form originally presented in [8]: First, (10) involves a rectification operation [·] + on the coefficients of the filter. Second, the integral over k is over the unit cell of the dual lattice G 0 , not over the interval [−π, π] (in dimension d = 1). Here, we provide details pertaining to both of these points.
Our starting point is (9). Here, to simplify notation, we work in dual lattice units such that vol(G 0 ) = (2π) 2 , since we can restore this later by the transformation Moreover, as in [8], we recognize that the optimization objective is a function only of the power spectrum |v(k + g)| 2 . Varying with respect to this quantity, however, requires that we enforce a positivity constraint, which implies a modified objective where α(k + g) is a Lagrange multiplier (one per frequency) enforcing the positive-definiteness of the filter power. Taking derivatives with respect to |v(k + g)| 2 and rearranging then gives: supplemented by the complementary slackness conditions α(k + g)|v(k + g)| 2 = 0.
Two things are important to note about this equation: First, if the sums over g are reduced to single terms (i.e., there is no power in either the correlation or filter spectra outside the unit cell of the reciprocal lattice), the term in brackets in the numerator of the right-hand side vanishes. If, in addition, α = 0, we are back with the same expression as found in [8]. Second, the left-hand side of this equation is the same for all g ∈ G, which implies that the right-hand side must be as well. Thus, Recall here that, with respect to the optimization, C x , σ 2 in , and σ 2 out are constants and β(k) is a function of |v(k + g)| 2 and these constants. Only |v(k + g)| 2 and α(k + g) are variables (and these are tied by complementary slackness). This suggests that the term in brackets cannot vanish for general C x unless only a single v(k + g ′ ) in the sum is nonzero.
2. Assume v(k + g) = 0 but v(k + g ′ ) ̸ = 0 for a single g ′ ̸ = g. Then complementary slackness allows α(k + g) ̸ = 0 and we have Of course, if we evaluate the right-hand side of (16) at g = g ′ , we get (assuming ν > 0) and we can rewrite the optimality condition as If we divide both sides by σ 2 out C x (k + g ′ ), we see that a solution with α(k + g) ≥ 0 always exists provided Now when Cx(k+g) Cx(k+g ′ ) > 1, the middle term in the inequality is positive but the first term on the left is larger than the right-hand side, so the inequality can never hold. By contrast, when Cx(k+g) Cx(k+g ′ ) < 1, the middle term is negative and the first term on the left is already smaller than the right-hand side, and the inequality always holds. Thus, the condition we need is that That is, the filter should respond only at the (aliased) frequency with highest power.
3. Assume v(k + g) = 0 for all g ∈ G. This happens, for instance, when the RF filter has a maximum frequency response at k max , which gives v(k + g) = 0 for all ∥k∥ > k max . Then, from (16) first-order stationarity requires Together, these calculations suggest the following pattern in the filter response v(k + g): Assuming that C x (k) is a monotonically decreasing function of k = ∥k∥, the filter can only respond for at most one aliased value of the frequency k. The optimal frequency at which to respond is, from point 2 above, the one with the highest power, and given our assumption about C x (k), this is the one with the lowest frequency. As a result, v(k) only has support within the unit cell G 0 , the sums over g ∈ G vanish, and at these frequencies, the solution is given by the term inside the brackets in (10). Moreover, at values of k for which the term inside the brackets in (10) would be 0 or negative, the correct solution is given by v(k + g) = 0 with α(k + g) given by (18). Thus, returning to (16), we have and collecting terms gives in . This has the solution When the term in brackets is positive, complementary slackness requires α(k) = 0 and thus ν ′ = ν, while when the bracketed term is less than or equal to zero, α(k) need not be zero, but complementary slackness requires |v(k)| 2 = 0, which holds when α(k) satisfies (18). Thus, the full solution is given by (10).

A.3 Extenstion to the spatiotemporal case
Consider the same setup as in the previous section but with a spacetime filter γw(z, t) = v(z, t). Now, the relevant correlations inside the determinant, analogous to (13), are between firing rates at different filters at different times: and once again we can diagonalize this by taking eigenvectors Now, just as in the spatial domain, this leads to an objective in the Fourier domain Clearly, by the same arguments as above, the solution to this is once again (10) with v(k) → v(k, ω), subject to a normalization condition over both spatial and temporal frequencies.
However, the solution given in (10) only determines the power spectrum of the optimal filter, not its phase. That is, if v(k, ω) = |v(k, ω)|e iϕ(k,ω) , ϕ(k, ω) is undetermined. However, for the minimumphase system, which is both causal and imposes the minimum temporal delay between the incoming signal and its filtered response [20], there is a relation between the filter's amplitude |v(k, ω)| (treating k as constant) and its phase ϕ(k, ω): where H[·] is the Hilbert transform (over ω) [26]. More explicitly, the minimum phase filter has log |v(k, ω)| + iϕ(k, ω) = log |v(k, ω)| − iH[log |v(k, ω)|], which is the complex conjugate of the analytic signal and so is analytic in the lower half plane (and thus causal). For a given, fixed k, this can be used to plot the causal temporal filters as depicted in Figure 2C. Note also that the spatiotemporal filter found by generalizing (10) is not exactly separable, though it may be well approximated by the product of a spatial and temporal filter in certain regimes.
Finally, we note that the solution (19) in both its spatial and spatiotemporal forms exhibits a kink in its power spectral density stemming from the positive rectification. This in turn implies that the optimal spatial filters exhibit ringing in both the space and time domains, unlike the actual observed retinal filters. Of course, these solutions, like many ideal filters, are all but impossible to implement in real systems, necessitating tradeoffs among ripple, rolloff, and other attributes [26]. Indeed, as noted above, allowable temporal filters are constrained by bipolar cell response properties [20]. As a result, in Figure 2B-D, we have smoothed the optimal power spectral densities with a double exponential kernel e −a|k| (e −a|ω| ) prior to transforming back to the space (time) domain.

A.4 Independence of multiple filters
Here, we generalize the approach given in Appendix A.1 to multiple filter types, arguing that requiring these filters not overlap in frequency space is sufficient to optimize the objective 2 in the infinite retina limit. The only relevant difference between this case and that of Appendix A.1 is that, in addition to terms like (13), the determinants also contain cross-terms of the form The presence of such cross-terms, which arises from correlations between activity in the two mosaics, reduce the magnitude of the determinants and thus overall information. However, we argue that such cross-terms should vanish for the optimal solution, yielding an information objective that is equivalent to a sum of individual mosaic terms like (9). Intuitively, this is because our power constraint in (9) remains unchanged under an orthogonal transformation of the filters, while information is subadditive when cross-terms between filters are nonzero. Thus, we can always increase information by choosing a filter basis in which both C x and C in are block diagonal and the mosaics decouple. This logic is similar to the argument of [13], where it was reconstruction performance that was equivalent under transformations of the filters but costs were superadditive.
However, this reasoning does not fully fix the choice of independent filters, since there are multiple ways to make off-diagonal blocks in the determinants vanish. For example, in the discrete (pixel, frame) formulation of the problem, the generalized eigenvectors of (C x , C in ) represent a special solution (the one that fully diagonalizes both matrices). Of course, these filters need not optimize the information objective. But there is an alternative solution that removes only the off-diagonal blocks in the determinants, reducing the problem to one of again optimizing individual filters with the objective (9): choosing spectrally disjoint filters with v * (k, ω)v ′ (k, ω) = 0 for different mosaics defined by v and v ′ . Note that this condition is merely sufficient, not necessary, to decouple mosaics, but it is independent of any assumptions on the image correlation structure C x or the filters v(k, ω). More specifically, it does not depend on any assumptions of spacetime separability for either.
In [13], the authors considered a form of this same bandwidth partitioning in the spatial case (v * (k)v ′ (k) = 0), but as we show Appendix A.5, this approach encounters a limit beyond the first two mosaics, when the spatial passband spectra of new filters substantially overlap. Rather, the more general solution is that mosaic filters are band-limited in both space and time, arranging themselves to tile minimally overlapping regions with highest power in the (k, ω) plane, as we find in the full nonlinear model (Figure 4).

A.5 Effects of channel capacity and new mosaic formation
Now we reconsider the analysis of Appendices A.1 and A.3 in the case that the number of RGCs J in the system is varied. To do so, we take a finite retina of size (L 1 , L 2 ) = (M 1 ∥a 1 ∥, M 2 ∥a 2 ∥) along each basis vector of the lattice R and assume a periodic extension of the signal outside this domain. This approximation allows us to continue working in the Fourier domain but sets a lower bound on the spatial frequencies a mosaic can hope to resolve (in addition to the upper bound resulting from the mosaic lattice spacing). Note that for a fixed retina size, increasing M i is equivalent to reducing ∥a i ∥, i.e., decreasing the spacing between RF centers.
More explicitly, let z j = n 1j a 1 +n 2j a 2 as before, with RGC number J ≈ vol(R)/vol(R 0 ) = M 1 M 2 . That is, the number of RGCs is the size of the retina divided by the unit cell volume of the Bravais lattice. In addition, the assumed periodicity of the signals implies that Fourier transforms involve only frequencies of the form k m = m1 M1 b 1 + m2 M2 b 2 for m i ∈ Z. From this, following the derivation preceding (14), we have that ψ j (k l ) = e izj ·k l are eigenvectors of the J × J matrix F ij : Clearly, this is the analogue of (14), and the form of the optimal spatial filter 10 once again holds in spacetime, albeit only at a finite set of frequencies k m ∈ G 0 . In what follows, we will be interested in the effect of changing J on the information encoded by the filters.
Turning to the images themselves, we here consider an idealized form of the power spectrum of natural videos [21], where α ≈ 1.3. Of course, this is not the same as the spectrum of images presented to the retina, since the latter is low-pass filtered via the modulation transfer function of the eye [9], but here we analyze this simpler form, since the qualitative results are similar in both cases. For this spectrum, the optimal filter (10) takes the form Now, we will define two useful frequencies in terms of our parameters Often, νσ 2 out ≪ 1, so that k c ≫ k f . In terms of these, the filter can be rewritten |v(k, ω)| 2 ∝ 1 2 1 1 +k(ω) 1 + 4 νσ 2 outk (ω) + 1 − 1 + withk ≡ (k/k f (ω)) α . That is, up to an overall rescaling, the filter depends only on the ratiok and the dimensionless parameter νσ 2 out . Naively, changes in k f simply rescale the filter in space, while ν is related to the power consumed by the filter. However, given a fixed power budget, these two parameters are tied, leading to a single family of spatial filters that depend on the temporal frequency ω (and vice-versa). The optimal filter response comprises a log-linear region for k < k f and a negative log-linear region for k f < k < k c . As RGC number J ∼ k 2 increases through the former, the mosaic fills, pushing the bandwidth limit (red line) upward. When the mosaic is filled, the upper bandwidth limit continues to increase as new RFs pack into the fixed retinal space and compress the mosaic. (B) Optimal filter power spectral density for different values of characteristic temporal response frequency ω 0 . Brighter colors indicate increasing frequency (see legend in C). In each case, the filtered spectrum is flat and proportional to ν −1 for k < k f . For k > k f , the power is noise-dominated, with a k −α/2 tail. (C) Information as a function of frequency for the same setup as in B. Note that information content falls precipitously for k > k f (the y axis is not logged). The dotted line indicates − log νσ 2 out . (D) As the power budget increases, ν decreases, with ν ∼ P −1 at low power and ν ∼ P −2 as P → ∞. For all plots, A = 100, σ in = 0.4, σ out = 1.25. In A-C, ν = 10 −3 . In A, log 10 ω 0 = −2.3; in D, log 10 ω 0 = 1.2.
We can gain additional insight by examining two important regimes in this distribution: consistent with the definition of k c as an ω-dependent cutoff frequency . Conversely, when k ≲ k f (ω), independent of J, the filter is separable, and the image spectrum is white [9]: Supplementary Figure 1A shows that indeed, the log-scaled filter magnitude is characterized by a linear regime when k ≲ k f , followed by a 1 k α/2 decay when k ≳ k f . Thus, despite the fact that power density is highest in the first regime, the second is light-tailed and stretches exponentially longer, with the result that total power is dominated by the tail (Supplementary Figure 1B). And indeed, numerical integration shows that power initially scales as ν −1 for large values, followed by P ∼ ν −1/2 as ν → 0, consistent with this claim (Supplementary Figure 1D). By contrast, as Supplementary Figure 1C shows, information density (log terms in (20)) falls off much more rapidly with frequency, such that it is the k ≲ k f regime that dominates, and information I ≈ − log ν σ 2 out . However, the above analysis is all in the continuous (infinite retina case). For the finite retina case, we note two important changes: First, as the number of RGCs increases, so does the power budgeted to the system. That is, the allowed power is assumed to be P (J) = P 0 J. This need not be the case -some other scaling could just as well be chosen -but it is consistent with the idea that metabolic costs are driven by factors specific to each cell. Second, the information in the system is no longer given by the the first term in the integral (20) but by a sum over discrete frequencies For example, when J = 1, M 1 = M 2 = 1 and m 1 = m 2 = 0 is the only option -the periodicity of the signals is the periodicity of the lattice -whereas J = 2 allows either M 1 = 2 or M 2 = 2 but not both, and J = 4 has M 1 = M 2 = 2. In general, for large enough J, we expect k m = ∥k m ∥ ∼ √ J.
With these conventions, we then have modified expressions for both the information in the system and its power consumption: where we have chosen to focus on filters narrowly concentrated around a single frequency ω 0 and again usedk ≡ (k/k f (ω 0 )) α .
Several things are important to note about the scaling relationships in (29) and (30). First, as discussed above, fork ≲ 1 (k ≲ k f ), P (k) + σ 2 out ≈ ν −1 ≈ P 0 + σ 2 out , independent of k. That is, contributions to (30) are all roughly the same for each frequency. Similarly, the denominator in (29) isk j kj +1 P j + σ 2 out ≈ σ 2 out +k j P j to lowest order ink, giving where we have used the volume equivalence k 2 J /k 2 f = J/J f along with the integral approximation We can understand this k ≲ k f regime as a mosaic filling phase: For small energy budgets P 0 ≪ σ 2 out , new RFs are added at their preferred size (k −1 peak ≈ k −1 f ) until the spacing between RFs reaches a minimum of roughly 2π/k f (at which point C x (k) ≈ σ 2 in ), with information increasing almost linearly as the mosaic covers new spatial locations. By contrast, for larger energy budgets P 0 ≳ σ 2 out , information gain is sublinear in J with a correction term ∼ J α/2 . By contrast, for k ≳ k f , and the system enters a mosaic compression phase. In this regime, noise dominates signal, P j + σ 2 out ∼ σ 2 out ( kc(ν) kj ) α/2 , and for the power above k f , we have which is again sublinear growth in J, but not so strong as the decay in (31) when P 0 ≳ σ 2 out . Finally, we consider the effects of temporal frequency ω on the relationships we have described above. In the case we have assumed, that of a filter with a narrow temporal passband centered around ω 0 , we note that (29) depends only onk j , and we havẽ Thus, if we increase ω 0 → θω 0 with θ > 1, we havek → θ 2k , which moves the numerator and denominator in (29) closer to each other and information to 0. The effect of this change in temporal frequency can be seen in Figure 2E. As a result, mosaics with the lowest temporal frequency are filled first, with new mosaics only beginning to fill once the marginal information gain for adding the first RF to a new mosaic exceeds that of adding an additional RF to an existing mosaic. In this way, the rate of each mosaic's filling decreases as new mosaics are added Figure 2F. The pairwise mutual information correction term, which captures the reduction in mutual information due to pairwise coactivations of RFs (plotted for σ out = 3.0) stays near zero for a shorter range of mosaic shifts at higher input noise levels, indicating that higher input noise favors aligned mosaics. Figure 5 in the main text shows that, for each filter type, increasing input noise levels encourage alignment. That is, the transition happens at a higher output noise level. Here, we explain this seemingly paradoxical effect using the mathematical analysis presented in [14], whose notation we follow. In that work, it was argued that the transition from aligned to anti-aligned mosaics could be understood as a process of increasing output noise leading to higher response thresholds, which in turn decorrelated nearby ON and OFF cells [12,14]. As these pairs become more independent (as measured by p 2 , their pairwise coactivation probability), they remain so even for small relative shifts in position, implying that anti-alignment does not increase coding redundancy. Add to this the fact that anti-aligned mosaics sample more unique spatial positions, and the result is that high output noise levels lead to anti-alignment under efficient coding.

B Analysis of mosaic phase transitions as a function of input noise
Conversely, though not analyzed in [14], the observation that increasing input noise levels favor alignment can be explained through the same formalism. Whereas [14] found that, in the independent neuron case, increasing output noise drove increases in response threshold (to better encode stimuli above the noise floor), the same analysis shows that increases in input noise decrease response thresholds (Supplementary Figure 3). Moreover, [14] showed that the coactivation probability between an ON-OFF RF pair at a distance of j lattice positions away in d = 1 with Gaussian inputs is given by where, for g j = w ⊤ 0 C x w j and f j = w ⊤ 0 w j , are the signal-to-noise-normalized threshold and unthresholded ON-OFF correlation, respectively. From this, it is clear (and Supplementary Figure 2A shows) that increasing σ in reducesθ not only through reductions in the optimal θ, but through increases in the normalization factor. This, in turn, reduces p 2 through an increase in the integration limits in (34). Thus, increasing σ in results in a narrower "zone of independence" (Supplementary Figure 2B) around each RF, with the result that mosaics cannot anti-align without losing information.
C Sinusoidal temporal kernels in the absence of parameterization Supplementary Figure 5 shows the distribution of temporal filters learned without using the parameterization in Equation 6, which have symmetric, sinusoidal shapes resembling Fourier bases. They also exhibit distinct clusters of temporal kernel shapes with varying spectral characteristics,  Figure 3: Optimal thresholds decrease with input noise as they increase with output noise. Calculations based on the independent neuron model of [14]. and each group corresponds to a spatial RF mosaic that tiles the space (Supplementary Figure 5E).
Supplementary Figure 5A-D shows that the temporal filters are almost perfectly orthogonal across the groups, which allows independent information to be conveyed in each group. The shapes of the temporal filters obtained in the full model (Supplementary Figure 3 and 4) can be interpreted as being as independent as possible while retaining biologically plausible filter shapes. In the case of auditory filters, it has been shown that efficient coding results in Fourier-like filters [27].

D Information gain upon adding individual RF mosaics
We examine how each of the RF mosaics contribute to the overall mutual information, by manually constructing four pairs of mosaics of spatial kernels placed in hexagonal grids, corresponding to four types of temporal filters: Slow, FastA, FastB, and FastC (6A). Given the 8 groups of spatiotemporal kernels, Figure 6B shows the order of adding mosaics such that each step brings the largest increase in the estimated mutual information. The optimal ordering indicates that it is always beneficial to add the slow mosaics first, in either order between ON and OFF (steps 1 and 2). Afterwards, the highest increase in MI is when adding one mosaic, either ON or OFF, from each shape in the order of increasing speed (steps 3-5), and then filling in the remaining mosaics again in the order of increasing speed (steps 6-8).

E Additional Technical Details
Determining the polarity of spatiotemporal kernels We used the parameterization of temporal kernels in Equation 6, and we additionally experimented with temporal kernels that can learn any unit-norm function in Supplementary Figure 5. Unlike that of spatial kernels in Equation 8 which has a positive peak intensity at the center, both parameterizations for temporal kernels are symmetric as to the positive and negative values the kernels can take, and there is no built-in indicator to distinguish ON and OFF kernels. Since ON and OFF RFs are the ones responding, respectively, to light increment and decrement [28], we determine the polarity of given temporal kernel w[t] based on the sign of its cumulative response to the unit step function u[t] up to the half duration of the kernel: In the visualizations of the spatial and temporal kernels, we flip the signs of the OFF temporal kernels as well as the corresponding spatial kernels, which keeps the separable spatiotemporal kernels unchanged (Equation 5). This makes the temporal kernels always visualized as ON filters, and the spatial kernels are displayed with a bright center for ON RFs and a dark center for OFF RFs. We also note that we plot the temporal kernels so that t = 0 is at the right and T = t − 1 is at the left of the graphs, making it comparable to the usual direction of visualizing temporal RFs, e.g. in spike-triggered averages. Data preprocessing and training The Chicago Motion Database [22] contains 257 natural video clips ranging from bees flying around the hive to waves approaching the shore. All video clips in the dataset are square-shaped and have varying sample rates. We resize and resample the video clips to 512×512 pixels and 30 fps using ffmpeg, in addition to converting it to grayscale by taking the luminosity channel of each frame using the Python Imaging Library. Following [10] and [14], we normalize each video to have zero mean and unit variance over all video dimensions, while the sliced video patches may have different means and variances. We used square-shaped input patches x(t) of size D = 18 2 = 324 pixels unless otherwise specified, and we applied a circular mask on the input patches following [14], which effectively constrained the spatial kernels within a circle bounded by the input square. Input patches consisted of either T = 10 or 20 frames sliced from 30-fps videos, and for computational efficiency, we used "valid"-mode convolutions with temporal kernels of the same size T , so each RGC yielded a single firing rate value r j for each input.   Supplementary Figure 8: The distribution of the spatial and temporal centroids of kernels (left), the RF mosaics (middle), and the temporal kernel shapes (right), as the number of RGCs under optimization increases from 40 to 200, by an increment of 20.
As the number of RGCs increases, faster mosaics emerge to cover the higher-frequency region of the spatiotemporal spectrum, while the individual spatial RFs become more precise, using more number of RGCs to tile the same visual space.