## Abstract

Normative theories and statistical inference provide complementary approaches for the study of biological systems. A normative theory postulates that organisms have adapted to efficiently solve essential tasks, and proceeds to mathematically work out testable consequences of such optimality; parameters that maximize the hypothesized organismal function can be derived *ab initio*, without reference to experimental data. In contrast, statistical inference focuses on efficient utilization of data to learn model parameters, without reference to any *a priori* notion of biological function, utility, or fitness. Traditionally, these two approaches were developed independently and applied separately. Here we unify them in a coherent Bayesian framework that embeds a normative theory into a family of maximumentropy “optimization priors.” This family defines a smooth interpolation between a data-rich inference regime (characteristic of “bottom-up” statistical models), and a data-limited *ab inito* prediction regime (characteristic of “top-down” normative theory). We argue that the flexibility afforded by our framework is essential to address a number of fundamental challenges relating to inference and prediction in complex, high-dimensional biological problems.

Ideas about optimization are at the core of how we approach biological complexity (1–3). Quantitative predictions about biological systems have been successfully derived from first principles in the context of efficient coding (4, 5), metabolic (6, 7), reaction (8, 9), and transport (10) networks, evolution (11), reinforcement learning (12), and decision making (13, 14), by postulating that a system has evolved to optimize some utility function under biophysical constraints. Normative theories generate such predictions about living systems *ab initio*, with no (or minimal) appeal to experimental data. Yet as such theories become increasingly high-dimensional and optimal solutions stop being unique, it gets progressively hard to judge whether theoretical predictions are consistent with data (15, 16), or to define rigorously what that even means (17–19). Alternatively, data may be “close to” but not “at” optimality (20, 21), but we lack a formal framework to deal with such scenarios. Lastly, normative theories typically make non-trivial predictions only under quantitative constraints which, ultimately, must have an empirical origin, blurring the idealized distinction between a data-free normative prediction and a data-driven statistical inference.

In contrast to normative theories which derive system parameters *ab initio*, the fundamental task of statistical inference is to reliably estimate model parameters from experimental observations. Here, too, biology has presented us with new challenges. While data is becoming increasingly high-dimensional, it is not correspondingly more plentiful; the resulting curse of dimensionality that statistical models face is controlled neither by intrinsic symmetries nor by the simplicity of disorder, as in statistical physics. To combat these issues and simultaneously deal with the noise and variability inherent to the experimental process, modern statistical methods often rely on prior assumptions about system parameters. These priors either act as statistical regularizers to prevent overfitting or to capture low-level regularities such as smoothness, sparseness or locality (22). Typically, however, their statistical structure is simple and does not reflect the prior knowledge about system function.

Normative theories and inference share a fundamental similarity: they both make statements about parameters of biological systems. While these statements have traditionally been made in opposing “data regimes” (Fig. 1), we observe that the two approaches are not exclusive and could in fact be combined with mutual benefit. To this end, we develop a Bayesian statistical framework that combines data likelihood with an “optimization prior” derived from a normative theory; contrary to simple, typically applied priors, optimization priors can induce a complex statistical structure on the space of parameters. This construction allows us to rigorously formulate and answer the following key questions: (1) Can one derive a statistical hypothesis test for the consistency of data with a proposed normative theory? (2) Can one define how close data is to the proposed optimal solution? (3) How can data be used to set the constraints in, and resolve the degeneracies of, a normative theory? (4) To what extent do optimization priors aid inference in high-dimensional statistical models? We illustrate the application of these questions and the related concepts to a simple model system, and demonstrate their relevance to real-world data analysis in a more realistic scenario of neural receptive field estimation.

## Results

### Bayesian inference and optimization priors

Given a probabilistic model for a system of interest, *P*(*x*|*θ*), with parameters *θ*, and a set of *T* observations (or data) , Bayesian inference consists of formulating a (log) posterior over parameters given the data:
where the constant term is independent of the parameters, is the likelihood assuming independent and identically distributed observations, and *P*(*θ*) is the prior, or the postulated distribution over the parameters in absence of any observation. Much work has focused on how the prior should be chosen to permit optimal inference, ranging from uninformative priors (23), priors that regularize the inference and thus help models generalize to unseen data (24, 25), or priors that can coarse-grain the model depending on the amount of data samples, *T* (26).

Our key intuition will lead us to a new class of priors fundamentally different from those considered previously. A normative theory for a system of interest with parameters *θ* can typically be formalized through a notion of a (upper-bounded) utility function, *U*(*θ*); optimality then amounts to the assumption that the real system operates at a point in parameter space, *θ**, that maximizes utility, *θ** = argmax_{θ}*U*(*θ*). Viewed in the Bayesian framework, the assertion that the system is optimal thus represents an infinitely strong prior that the parameters are concentrated at *θ**, i.e., *P*(*θ*) = *δ*(*θ* − *θ**). In this extreme case, no data is needed: the prior fixes the values of parameters and typically no finite amount of data will suffice for the likelihood in Eq (1) to move the posterior away from *θ**. This concentrated prior can be however interpreted as a limiting case of a softer prior that “prefers” solutions close to the optimum.

Consistent with the maximum entropy principle put forward by Jaynes (27), we therefore consider for our priors distributions that are as random and unstructured as possible while attaining a prescribed average utility:

This is in fact a family of priors, parametrized by *β* ∈ [0,∞): when *β* = 0, parameters are distributed uniformly over their domain without any structure and in absence of any optimization; as *β* → ∞, parameter probability localizes at the point *θ** that maximizes the utility to *U*_{max} (if such a point is unique) irrespective of whether data supports this or not. At finite *β*, however, the prior is “smeared” around *θ** so that the average utility, increases monotonically with *β*. For this reason, we refer to *β* as the “optimization parameter,” and to the family of priors in Eq (2) as “optimization priors.”

The intermediate regime, 0 < *β* < ∞, in the prior entering Eq (1) is interesting from an inference standpoint. It represents the belief that the system may be “close to” optimal with respect to the utility *U*(*θ*) but this belief is not absolute and can be outweighed by the data: the log likelihood, , grows linearly with the number of observations, *T*, matching the roughly linear growth of log prior with *β*. Varying *β* thus literally corresponds to the interpolation between an infinitely strong optimization prior and pure theoretical prediction in the “no data regime,” and the uniform prior and pure statistical inference in the “data rich regime”, as schematized in Fig. 1.

In the following, we apply this framework to a toy model system, a single linear-nonlinear neuron, which is closely related to a linear classifier. This example is simple, well-understood across multiple fields, and low-dimensional so that all mathematical quantities can be constructed explicitly; the framework itself is, however, completely general. The same example is used throughout the paper to demonstrate how the ability to encode the entire shape of the utility measure into the optimization prior opens up a more refined and richer set of optimality-related statistical analyses.

### Example: Efficient coding in a simple model neuron

Let us consider a simple probabilistic model of a spiking neuron (Fig. 2A), a broadly applied paradigm in sensory neuroscience (28–32). The neuron responds to one-dimensional continuous stimuli *x*_{t} either by eliciting a spike (*r*_{t} = 1), or by remaining silent (*r*_{t} = 0). The probability of eliciting a spike in response to a particular stimulus value is determined by the nonlinear saturating stimulus-response function. The shape of this function is determined by two parameters: position *x*_{0} and slope *k* (see Methods).

Parameters *θ* = {*x*_{0},*k*} fully determine the function of the neuron, yet remain unknown to the external observer. Statistical inference extracts parameter estimates using experimental data consisting of stimulus-response pairs (Fig. 2B, left panel), by first summarizing the data with the likelihood, (Fig. 2B, right panel), followed either by maximization of the likelihood, in the maximumlikelihood (ML) paradigm, or by deriving from the posterior, Eq (1), in the Bayesian paradigm.

To apply our reasoning, we must propose a normative theory for neural function, form the optimization prior, and combine it with the likelihood in Fig. 2B, as prescribed by the Bayes’ rule in Eq (1). An influential theory in neuroscience called “efficient coding” postulates that sensory neurons maximize the amount of information about natural stimuli they encode into spikes given biophysical constraints (5, 31, 33–36). This information-theoretic optimization principle (37) has correctly predicted neural parameters such as receptive field shapes (4, 34) and the distribution of tuning curves (17, 38), as well as other quantitative properties of sensory systems (39–43), *ab initio*, from the distribution of ecologically relevant stimuli (2, 34). As such, efficient coding is also a suitable normative theory for our model neuron.

To apply efficient coding, we need to specify a distribution from which the stimuli *x*_{t} are drawn. In reality, neurons would respond to complex and high-dimensional features of sensory inputs, such as a particular combination of odorants, timbre of a sound or a visual texture, in order to help the animal discriminate between environmental states of very different behavioral relevance e.g. a presence of a predator, a prey or a mate. To capture this intuition in our simplified setup, we imagine that the stimuli *x*_{t} are drawn from a multi-modal distribution, which is a mixture of three different environmental states, labeled by *c*_{t} (Fig. 2C). Efficient coding then postulates that the neuron maximizes the mutual information, *I*(*r*_{t};*c*_{t}), between the environmental states, *c*_{t}, that gave rise to the corresponding stimuli, *x*_{t}, and the neural responses, *r*_{t}. Mutual information, which can be evaluated for any choice of parameters *k*, *x*_{0}, provides the utility function, *U*(*k*,*x*_{0}) = *I*(*r*_{t};*c*_{t}), relevant to our case. Figure 2D shows that *U* is bounded between 0 and 1 bit (since the neuron is binary), but does not have a unique maximum. Instead, there are four combinations of parameters that define four degenerate maxima, corresponding the neuron’s nonlinearity being as steep as possible (high positive or negative *k*) and located in any of the two “valleys” in the stimulus distribution (red peaks in Fig. 2D). Moreover, the utility function forms broad ridges on the parameter surface, and small deviations from optimal points result only in weak decreases of utility. Consequently, formulating clear and unambiguous theoretical predictions is difficult, an issue that has been recurring in the analysis of real biological systems (18, 44, 45).

Given the utility function, the construction of the maximumentropy optimization prior according to Eq (2) is straightforward. Explicit examples for different values of *β* are shown in Fig. 2E (left panel); more generally, the average utility of the prior monotonically decreases as the prior becomes less localized around the optimal solutions, as measured by the entropy of the prior (Fig. 2E, right panel). This completes our setup and allows us to address the four questions posed in the Introduction.

### Question 1: Statistical test for the optimality hypothesis

Given a candidate normative theory and experimental data for a system of interest, a natural question arises if the data supports the postulated optimality. This question is nontrivial for two reasons. First, optimality theories typically do not specify a sharp boundary between optimal and nonoptimal parameters, but rather a smooth utility function *U*(*θ*) (Fig. 3A): how should the test for optimality be defined in this case? Second, a finite dataset might be insufficient to infer a precise estimate of the parameters *θ*, but will instead yield a (possibly broad) likelihood surface (Fig. 3B): how should the test for optimality be formulated in the presence of such uncertainty?

Here we devise an approach to address both issues. The basis of our test is a null hypothesis that the system is not optimised, i.e., that its parameters have been generated from a uniform random distribution on the biophysically accessible parameter domain. This distribution is exactly the optimization prior *P*(*θ*|*β* = 0). The alternative hypothesis states that the parameters are drawn from a distribution *P*(*θ*|*β*) with *β* > 0. To discriminate between the two hypotheses, we use a likelihood ratio test with the statistic λ, which probes the overlap of high-likelihood and high-utility parameter regions. Specifically, we define the marginal likelihood of *β* given data, (Fig. 3C), and then define λ as the log ratio between the maximal marginal likelihood, , and the marginal likelihood under the null hypothesis, (see Methods).

The test statistic λ has a null distribution that can be estimated by sampling (Fig. 3D), with large λ implying evidence against the null hypothesis; thus, given a significance threshold, we can declare the system to show significant degree of optimisation, or to be consistent with no optimisation. This is different from asking if the system is “at” an optimum: such a narrow view seems too restrictive for complex biological systems. Evolution, for example, might not have pushed the system all the way to the biophysical optimum (e.g., due to mutational load or because the adaptation is still ongoing), or the system may be optimal under slightly different utility function or resource constraints than those postulated by our theory (21). Instead, the proposed test asks if the system has relatively high utility, compared to the utility distribution in the full parameter space.

While principled, this hypothesis test is computationally expensive, since it entails an integration over the whole parameter space to compute the marginal likelihoods, , as well as Monte Carlo sampling to generate the null distribution. The first difficulty can be resolved when the number of observations *T* is sufficient such that the likelihood of the data, , is sharply localized in the parameter space; in this case the value of the utility function at the peak of the likelihood itself becomes the test statistic and the costly integration can be avoided (see Methods). The second difficulty can be resolved when we can observe many systems and collectively test them for optimality; in this case the distribution of the test statistic approaches the standard χ^{2} distribution (see Methods).

### Question 2: Inferring the degree of optimality

Hypothesis testing provides a way to resolve the question whether the data provide evidence for system optimisation or not (or to quantify this evidence with a p-value). However, statistical significance does not necessarily imply biological significance: with sufficient data, rigorous hypothesis testing can support the optimality hypothesis even if the associated utility increase is too small to be biologically relevant. Therefore, we formulate a more refined question: How strongly is the system optimized with respect to a given utility, *U*(*θ*)? Methodologically, we are asking about what value of the optimization parameter, *β*, of the prior is supported by the data . In the standard Bayesian approach, all parameters of the prior are considered fixed before doing the inference; prior is then combined with likelihood to generate the posterior (Fig. 4A). Our case corresponds to a hierarchical Bayesian scenario, where *β* is itself unknown and of interest. In the previous section we chose it by maximizing the marginal likelihood, to devise a yes/no hypothesis test. Here, we consider a fully Bayesian treatment, which is particularly applicable when we observe many instances of the same system. In this case, we interpret different instances (e.g., multiple recorded neurons) as samples from a distribution determined by a single population optimality parameter *β* (Fig. 4B) that is to be estimated. Stimulus-response data from from multiple neurons are then used directly to estimate a posterior over *β* via hierarchical Bayesian inference.

To explore this possibility, we generated parameters *θ*_{n} of model neurons from three different distributions: strongly optimized (*β* = 12; Fig. 4C, left panel), weakly optimized (*β* = 4; Fig. 4C, middle panel) and non-optimal (Gaussian distribution of parameters; Fig. 4C, right panel). From each neuron we obtained an experimental dataset of 100 stimulus-response pairs. Using standard hierarchical Bayesian inference we then computed the posterior distributions over the population optimality parameter, *β* (purple lines in Fig. 4D; see Methods). In each of the three cases, posterior averages, , (Fig. 4D; dashed purple lines) closely approximated ground truth values.

To aid the interpretation after having analysed such putative data, the estimated population optimality parameter, , can be mapped into normalized average utility using the curve (Fig. 2E), and compared to the average utility assuming no optimization, . As shown in Fig. 4E (purple bars), this enables us to report the optimality on a [0, 1] scale. Furthermore, it is straightforward to use posterior distributions to compute uncertainty estimates on *β* or on the average utility.

### Question 3: Data resolves ambiguous theoretical predictions

When the predictions of a normative theory are degenerate, with multiple maxima of the utility function, the biological context typically forces us to choose between two interpretations. On the one hand, we may observe multiple instances of the biological system and each instance could be an independent realization sampled from any of the maxima: statistical analyses of optimality thus need to consider and integrate integrate over the whole parameter space, as in the approaches described above. On the other hand, we may observe a single (e.g., evolutionary) realization of the biological system which we hypothesize corresponds to a single optimum of the utility function. Our task is then first to identify that relevant maximum; if it exists, subsequent analyses can follow up on how well data agrees with that prediction and how surprising such an agreement might be in face of multiple alternative maxima.

In our example, multiple values of slope and offset yield optimal or close to optimal neural performance, resulting in ambiguous theoretical predictions. As a simple illustration of how data can break such ambiguities, we consider three example neurons with varying degree of optimality (Fig. 5A) and observe how their posteriors look like after seeing as few as *T* = 12 stimulus-response pairs from each neuron (Fig. 5B). All three simulated datasets reduced the uncertainty (entropy) about the neuron’s parameters by a similar amount, as reflected by the entropy and utility of the posterior versus the entropy and utility of the prior (Fig. 5C). Despite similar reductions in entropy, the resulting inferences were very different in terms of agreement with the theory. Only the posterior of the first neuron concentrated in a high-utility region of the parameter domain, thus clearly identifying one of the four peaks of the utility function as consistent with the operating regime of the simulated neuron. The two remaining posteriors are concentrated in regions of the parameter space which weakly overlap with the prior, or where prior probability is close to 0. To capture these qualitative differences mathematically, we define and compute the *mode entropy*, where each mode corresponds to the attraction basin of a local utility maximum. Optimality theories with degenerate maxima will allocate the prior probability relatively evenly among the modes, resulting in high mode entropy (here, 2 bits, i.e., 4 possible local maxima). A few observations of neuron 1 consistent with an optimal solution drastically collapsed this mode uncertainty and identified the single relevant utility maximum; this decrease was smaller for slightly suboptimal neuron 2 and vanished for neuron 3 (Fig. 5D).

This is a very non-standard application of the Bayesian framework at small sample sizes, *T*: here, the structure of the prior (i.e., the normative theory) dominates the posterior, in what we refer to as the “data-regularized prediction” regime. We recall that our goal is to derive *ab initio* theoretical predictions, not fit parameters to reproduce the data, and the data is only used to disambiguate the prediction – to identify which utility maximum, if any, is realized. If we track the evolution of the average utility, full posterior entropy, and the mode entropy with the number of data points *T*, we clearly see the transition from such “data-regularized prediction” regime dominated by the prior normative theory, to the “theory-regularized inference” regime in the large sample limit (Fig. 5E). In the first regime, data removes the theoretical ambiguity and collapses the mode entropy with *T* < 10 samples; in the second regime, the actual parameter values (*k*,*x*_{0}) are inferred with increasing precision, as evidenced by posterior entropy that continues to decrease linearly in the log sample size (corresponding to the standard asymptotic inverse scaling of the variance in parameter estimates with the sample size).

In the “data-regularized prediction” regime, *β* also serves a novel role: when the normative theory has multiple optima with a broader spectrum of utility values, *β* determines which of the peaks are considered as nearly degenerate candidate predictions. A peak with utility *U*′ < *U*_{max} will be suppressed in the prior by ~ exp(−*β*(*U*_{max} −*U*′)), and, for sufficiently high *β*, the alternative theoretical prediction corresponding to *U*′ will be disregarded irrespective of the data. Here we showed that the ambiguities of normative theories can often be resolved in our Bayesian framework in the “dataregularized prediction” regime by a very small amount of data, which breaks the degeneracy of the theoretical predictions. This power may appear trivial at first glance, because the parameter space of our example is two dimensional and so priors and posteriors can be evaluated explicitly and plotted across their whole domain. In more realistic cases involving tens of parameters, however, finding all (nearly) degenerate maxima of the utility function and deciding whether data is “close to” any one of them becomes a daunting task due to the curse of dimensionality. In the past, this has severely limited the application of optimality principles to complex systems with more than a few parameters (9, 31, 41), except in those rare cases where strict guarantees existed (21). In contrast, even in spaces of high dimensionality, posteriors resulting from our framework can be sampled with Monte-Carlo methods or optimized by well-developed methodology (25), with search concentrated around the unique peak of the normative theory that is simultaneously permitted by the chosen value of *β* and is consistent with the data, if such a peak exists. Intuitively, theory “proposes” possible optimal solutions *ab initio* while data “disposes” with those degenerate solutions for which there is no likelihood support.

### Question 4: Optimization priors improve inference for high-dimensional problems

In the last section we extend our toy model neuron with 2 parameters to a more realistic model with 16×16 parameters. The purpose of this exercise is two-fold: technically, we will show that an application of our framework to a realistically high-dimensional problem is feasible; formally, we will show that optimization priors can play a powerful role in regularizing such high-dimensional inference problems.

We simulated the responses of a Linear-Nonlinear-Poisson (LNP) neuron (30) to natural image stimuli (Fig. 6A). Natural image patches *x*_{t} (16 × 16 pixels each) are projected onto a linear filter *ϕ*, and the output of the filter *s*_{t} is transformed with a logistic nonlinearity into average neural firing rate λ_{t}. The number of spikes elicited by the neuron *r*_{t} is then drawn from a Poisson distribution, with mean λ_{t}. The goal of data analysis is to estimate the linear filter *ϕ* ∈ ℝ^{16×16}, which determines the sensory feature encoded by the neuron, from data consisting of stimulus-response pairs, . Experimentally observed filters *ϕ* have been suggested to maximize the sparsity of responses *s*_{t} to natural stimuli. A random variable is sparse when most of its mass is concentrated around 0 at fixed variance. These experimental observations have been reflected in the normative model of sparse coding, in which maximization of sparsity has been hypothesized to be beneficial for energy efficiency, flexibility of neural representations, and noise robustness (46, 47). Filters optimized for sparse utility *U*_{s}(*ϕ*) (see Methods) are oriented and localized in space and frequency (Fig. 6B, left panel). In contrast, filters optimized for minimal sparsity are spatially global (Fig. 6B, right panel).

To simulate a realistic experiment, we constructed a model neural population of 64 neurons with 40 filters optimized under sparse utility, and 24 generated according to a related, but different criterion (Fig. 6C; see Methods for details). We generated neural responses by exposing each model neuron to a sequence of 2000 natural image patches. Using these simulated data we then inferred the filter estimates, , using Spike Triggered Average (STA) (18, 48), which under our assumptions is equivalent to the maximum likelihood (ML) estimate (30) (see Methods). STAs computed from limited data recover noisy estimates of neural filters (Fig. 6C, right panel).

We first asked whether our inferences provide individual, neuron-by-neuron support for the optimality hypothesis, under the sparse utility, *U*_{s}(*ϕ*), as in Question 1. Given that the number of data points, *T*, is sufficient for localized filter estimates, we use the sparse utility of the inferred filter itself as a test statistic, avoiding costly marginalization of the highdimensional likelihood. To construct the null distribution for the test, we sampled 5 · 10^{4} random filters consistent with optimization prior *P*(*ϕ*|*β* = 0), and declared the 95th percentile of this distribution to be the optimality threshold (Fig. 6D, dashed red line). Orange and green dots in Fig. 6D denote utility values of filters marked with corresponding colors in Fig. 6C. All STAs of filters optimized for sparsity pass our optimality hypothesis test.

We next asked whether all filters together can be used to quantify the degree of population optimality, as in Question 2. We estimated approximate posterior over parameter *β* via rejection sampling (see Methods), using all inferred filters in our model population (Fig. 6E, left panel, purple line). For comparison, we also computed posteriors using 64 sparsity-maximizing filters (Fig. 6E, left panel, red line), and 64 sparsity-minimizing filters (Fig. 6E, left panel, gray line). Figure 6E, right panel, shows the inferred average utility of the analyzed population, normalized relative to that of the theoretically optimal population, illustrating how the “degree of population optimality” can easily be obtained from experimental data.

Lastly, we asked whether normative theories can provide powerful priors to aid inference in high-dimensional problems. Using our sparse utility, *U*_{s}(*ϕ*), we formulated optimization priors for various values of *β* and computed maximum-a-posteriori (MAP) filter estimates from simulated data (Fig. 6F; see Methods for details). Increasing values of *β* interpolate between pure data-driven ML estimation (Fig. 6F, second column from the left) that ignores the utility, and pure utility maximization (Fig. 6F, right column) at very high *β* = 10^{3} where the predicted filters become almost completely decoupled from data; these two regimes seem to be separated by a sharp transition.

For intermediate *β* = 10,100, MAP filter estimates show a large improvement in estimation performance relative to the ML estimate (as quantified by Pearson correlation with the ground truth). Optimization priors achieve this boost in performance because they implicitly encode many notions about how neural filters look like (localization in space and bandwidth, orientation), which the typical regularizing priors (e.g., L2 or L1 regularization of *ϕ* components) will fail to do. While specialized priors designed for receptive field estimation can capture some of these characteristics explicitly (18, 49), optimization priors grounded in the relevant normative theory represent the most succinct and complete way of summarizing our prior beliefs about receptive fields. Importantly, using an optimization prior does not imply that the neural data *must* have been generated by an optimal neuron: even if the real neuron is not optimal, the inference will benefit from the implied smoothness, localization, and orientation properties suggested by the prior (Fig. 6F, bottom row). How strongly the prior shapes the resulting inference will be determined by *β*, which can be set via the standard method of cross-validation to maximize the performance of the inferred model on withheld data.

Taken together, we suggest that whenever the normative theory for a high-dimensional system exists and the task at hand is to infer the system parameters from data – a task which is typically under-determined for biological complex systems and networks – optimization priors could lead to a crucial boost in the performance of our inferences. Beyond capturing the low-level statistical expectations for the parameters (such as their smoothness or sparsity), optimization priors impose a complex structure on the parameter space and, for example, *a priori* exclude swaths of parameter space that lead to non-functioning biological systems. In this way, the statistical power of the data can be used with maximum effect in the parameter regime that is of actual biological relevance.

## Discussion

In this paper, we presented a statistical framework that unifies normative, top-down models of complex systems which derive system’s parameters *ab initio* from an optimization principle, with bottom-up probabilistic models which fit system’s parameters to data. This union of the two approaches, often applied separately, becomes straightforward in the Bayesian framework, where the normative theory enters as the prior and data enters as the likelihood. The two traditional approaches are recovered as limiting cases; more importantly, interpolation between these two limits spans a mixed regime of optimization and inference that is highly relevant for understanding complex biological systems. We illustrated its relevance by describing how (i) measurements can be used to test a given system for consistency with an optimization theory; (ii) “closeness to optimality” can be defined and inferred; degeneracies of theoretical predictions can be broken by a small amount of data; (iv) optimization theories can provide powerful priors to aid inference in high-dimensional problems.

Our framework dovetails with other approaches which address the issues of ambiguity of theoretical predictions and model indentifiability given limited data in biology. The framework of “sloppy-modelling” (50, 51), grounded in dynamical systems theory, characterizes the dimensions of the parameter space which yield qualitatively similar behavior of the system. In our framework these dimensions correspond to regions of the parameter space of equal or similar utility. Another important conceptual advance grounded in statistical inference has been to use limited data to coarse-grain probabilistic models (26, 52, 53). Here, we demonstrate that breaking degeneracies of theoretical predictions with small data samples can be seen as a related coarse-graining approach.

### Applications and extensions

In theoretical biology one is frequently confronted with a scenario where a biological system is hypothesized to be optimal (e.g., neuron maximizes information transmission) under some quantitative constraint (e.g., a limit to the maximal firing rate, or intrinsic noise; (17, 28, 32)). When the value of the constraint is known, the prediction naturally emerges from the theory – but what if the constraint value is not known? One way to address the problem in our framework would be to consider the system parametrized by *all* parameters (including the constraints). In a pure optimization setup, utility function reaches a nontrivial maximum in the interior of the allowed interval for some parameters, while for the others optimization would drive them to infinity—even when that is physically impossible. In our classifier example, optimality sets the position of the nonlinearity, *x*_{0}, to a finite value, whereas it attempts to increase the slope, *k*, without bound—physically, this would imply reducing the noise in the classifier to zero. In contrast, in our framework data will localize the otherwise-unbounded *k* value, reflecting the existence of a physical constraint in the real system. Thus, optimization prediction will correspond to finding the optimal *x*_{0} given the value of *k* that is supported by data. In other words, our framework has the ability to jointly infer the parameters that correspond to constraints while simultaneously learning the remaining parameters from the normative theory. In more realistic settings, this ability could be greatly potentiated. For example, a standard neuron model could be parametrized by hundreds of parameters (corresponding to the receptive field) plus several parameters for the nonlinearity, with essentially all parameters determined by optimality except for the nonlinearity steepness (noise constraint) and/or maximum value (maximum firing rate constraint). Traditionally, these two values would be set manually and then optimization would be carried out for receptive field parameters for all values of the constraint(s) to test for match with data. Such manual “finetuning” of constraints, or manual adjustment of the bounding intervals for those parameters that are unconstrained by optimality theory, to yield consistency of optimality predictions with data is clearly problematic from the statistical viewpoint, since it amounts to *de facto* (over-)fitting that is not controlled for. Our framework solves such problems automatically in a single step, by reinterpreting constraints as the remaining model parameters to be *rigorously* inferred from data, formally reducing the dimensionality of the fitting problem from the number of all parameters to the number of those unconstrained by optimizing the utility function. Systematically assessing the interaction between fitting and optimization within our framework is an interesting topic for future research.

Our framework provides a new approach to handle scenarios where the optimization theory formulates degenerate, nonunique predictions. A frequent solution is to postulate further constraints within the theory itself, which disambiguate the predictions (15). Our proposed mechanism for breaking the degeneracy of normative theories is different, yet complementary: using a small amount of data to localize the theoretical predictions to the relevant optimum, against which further statistical tests can be carried out. A possible extension would be to formally incorporate into the prior the knowledge that the data is, for example, drawn from at most one local optimum (whose identity is, however, unknown) of the normative theory.

We foresee additional applications of the proposed framwework which are beyond the scope of this paper. For example, it is often difficult to determine which optimality criterion is plausibly implemented by the biological system of interest (17, 54, 55). Because we leverage the well-understood machinery of Bayesian inference, our framework could be used to perform model selection for the utility function that best explains the data. As an intriguing hypothesis we also mention that our framework can also be used to infer the degree of “anti-optimization” with respect to particular utility functions (if we permit *β* < 0), leading perhaps to novel (anti)optimization theories of system function.

### Outlook

Theories of biological function are currently less structured than physical theories of nonliving matter. This is partially due to the inherent properties of biological systems such as intrinsic complexity and lack of clear symmetries. It is also partially due to the lack of theoretical approaches to systematically coarse-grain across scales and identify relevant parameters. We hope that new approaches emerging from the synthesis of established methodologies, such as statistical physics, inference, and optimality theories, can provide novel ways in tackling these fundamental issues.

## Materials and Methods

### Model neuron and mutual information utility function

A model neuron elicits a spike at time *t* (*r*_{t} = 1) with a probability:
the stimuli *x*_{t} were distributed according to a Gaussian Mixture Model, , where *w*_{i} = 1/3 are weights of the mixture components, *μ*_{1,2,3} = −2, 0, 2 are the means, and *σ*_{i} = 0.2 are standard deviations.

To estimate mutual information between class labels and neural responses, we generated 5 · 10^{4} stimulus samples *x*_{t} from the stimulus distribution. Each sample was associated with a class label *c*_{t} ∈ {1, 2, 3}, corresponding to a mixture component. We created a discrete grid of logistic-nonlinearity parameters by uniformly discretizing ranges of slope *k* ∈ [−10, 10] and position *x*_{0} ∈ [−3, 3] into 128 values each. For each pair of parameters on the grid, we simulated responses of the model neuron to the stimulus dataset and estimated the mutual information directly from a joint histogram of responses *r*_{t} and class labels *c*_{t}.

### Likelihood ratio test of optimality

The proposed test uses the likelihood ratio statistic,

The null hypothesis is rejected for high values of λ. The marginal likelihood of , depends on the overlap of parameter likelihood and the normative prior, , where Θ is the region of biophysically feasible parameter combinations.

The null distribution of λ is obtained by sampling in three steps: (i) sample a parameter combination *θ* from a uniform distribution on *θ*, i.e. *P* (*θ*|*β* = 0); (ii) sample a data set according to the likelihood ; (iii) compute the test statistic λ according to Eq. (4). This computationally expensive process simplifies in two situations described below.

### Data-rich-regime simplification

In the data-rich regime, when the parameter likelihood is concentrated at a sharp peak positioned at , likelihood ratio depends only on the value of utility at :
which is a non-decreasing function of the utility . Thus, this test is equivalent to a test that uses the utility estimate itself, , as the test statistic, making it possible to avoid the costly integration over Θ. The null distribution can then be obtained by computing *U* (*θ*) at uniformly sampled *θ*.

### Multiple system instances simplification

If multiple instances of the system are available and we can assume that their parameters *θ*_{1}, *θ*_{2}, …, *θ*_{N} are i.i.d. samples from the same distribution *P* (*θ*|*β*), then the datasets are also i.i.d., . We test the hypotheses *β* = 0 vs. *β* > 0 with the likelihood ratio statistic

By Wilks’ theorem, for large *N* the null distribution of λ approaches the distribution (with a point mass of weight 1/2 at λ = 0, because we only consider *β* ≥ 0). This removes the need for sampling in order to obtain the null distribution.

### Hierarchical inference of population optimality

Assuming that experimental datasets are i.i.d., the posterior over population optimality parameter *β* takes the form:
where *θ* = (*k*_{n}, *x*_{0,n}) is a vector of neural parameters (slope and position), and *P* (*β*) is a prior over *β*. We approximated integrals numerically via the method of squares. Neural parameter values were sampled from ground-truth distributions via rejection sampling.

### Inference of receptive fields with optimality priors

We randomly sampled 16 × 16 pixel image patches from the van Hateren natural image database (56) and standardized them to zero mean and unit standard deviation. Neural responses were simulated using a Linear-Nonlinear Poisson (LNP) model:
where λ_{t} is the rate parameter equal to:
where *L* = 20 was the maximal firing rate.

Given a linear filter *ϕ*, we quantified sparsity of its responses to natural images using the following function:

Filter sparsity was averaged across the natural image dataset consisting of 5 · 10^{4} standardized image patches randomly drawn from the van Hateren image database. The mean and standard deviation of filters *ϕ* was set to be 0 and 1 respectively.

We generated a model population of 64 neurons. We learned 64 linear filters using the logistic Independent Component Analysis (46). We sorted learned filters according to their sparse utility. We then retained 24 least sparse filters. The remaining 40 filters in the population were obtained by maximizing the sparse utility *U*_{s}, starting from different random conditions. Prior to learning, we reduced the dimensionality of stimuli to 64 dimensions using Principal Component Analysis.

To test individual filters for optimality, we generated the null distribution of utility values by bootstrapping 10^{6} random filters as follows: (i) draw a random integer *K* between 1 and 128; (ii) superimpose *K* randomly selected principal components of natural image patches; each component is multiplied by a random coefficient ; (iii) generate a 2D Gaussian spatial mask centered at a random position on the image patch; lengths of horizontal and vertical axes of the Gaussian ellipse were drawn independently; (iv) multiply the random filter and the Gaussian mask. This procedure ensures that a range of filters of different sparsity will be randomly generated. Filters were standarized to zero mean and unit standard deviation.

To establish a measure of optimality at a population level, we needed to simplify the integration over all receptive field parameters, which was intractable due to their highdimensionality. Computation of posteriors over *β* in Eq (9) was therefore approximated as follows:

We approximated via rejection sampling, noting that , i.e., the probability of a high dimensional receptive field is determined solely by a onedimensional utility function.

For each *β* we randomly sampled 10^{6} filters from the proposal distribution, as described above, and retained only those consistent with *P* (*U*_{s}(*θ*)|*β*) via rejection sampling. Obtained utility values were fitted with a Gaussian distribution, used to evaluate posteriors over *β*, with point estimates being posterior means; the prior over *β* was uniform over the range displayed in the figures.

To infer the receptive fields from simulated neural responses using our framework, we assumed the following optimization prior over receptive fields derived from the sparsity utility in Eq (12):
where *z*(*ϕ*_{n}) denotes normalization of the receptive field to zero mean and unit variance. The sparse utility was evaluated over 10^{4} randomly sampled image patches. The resulting log-posterior took the following form:

MAP inference was performed via gradient ascent on the logposterior. Receptive fields were inferred with different priors corresponding to following values of the *β* parameter: 0, 1, 10, 100, 1000. Receptive fields were estimated after reducing the dimensionality of stimuli with Principal Component Analysis to 64 dimensions. Estimation via gradient ascent on the log-posterior was performed in the PCA domain.