## 1 Abstract

A cornerstone of theoretical neuroscience is the circuit model: a system of equations that captures a hypothesized neural mechanism. Such models are valuable when they give rise to an experimentally observed phenomenon – whether behavioral or in terms of neural activity – and thus can offer insights into neural computation. The operation of these circuits, like all models, critically depends on the choices of model parameters. Historically, the gold standard has been to analytically derive the relationship between model parameters and computational properties. However, this enterprise quickly becomes infeasible as biologically realistic constraints are included into the model increasing its complexity, often resulting in *ad hoc* approaches to understanding the relationship between model and computation. We bring recent machine learning techniques – the use of deep generative models for probabilistic inference – to bear on this problem, learning distributions of parameters that produce the specified properties of computation. Importantly, the techniques we introduce offer a principled means to understand the implications of model parameter choices on computational properties of interest. We motivate this methodology with a worked example analyzing sensitivity in the stomatogastric ganglion. We then use it to go beyond linear theory of neuron-type input-responsivity in a model of primary visual cortex, gain a mechanistic understanding of rapid task switching in superior colliculus models, and attribute error to connectivity properties in recurrent neural networks solving a simple mathematical task. More generally, this work suggests a departure from realism vs tractability considerations, towards the use of modern machine learning for sophisticated interrogation of biologically relevant models.

## 2 Introduction

The fundamental practice of theoretical neuroscience is to use a mathematical model to understand neural computation, whether that computation enables perception, action, or some intermediate processing [1]. A neural computation is systematized with a set of equations – the model – and these equations are motivated by biophysics, neurophysiology, and other conceptual considerations. The function of this system is governed by the choice of model parameters, which when configured in a particular way, give rise to a measurable signature of a computation. The work of analyzing a model then requires solving the inverse problem: given a computation of interest, how can we reason about these particular parameter configurations? The inverse problem is crucial for reasoning about likely parameter values, uniquenesses and degeneracies, attractor states and phase transitions, and predictions made by the model.

Consider the idealized practice: one carefully designs a model and analytically derives how model parameters govern the computation. Seminal examples of this gold standard (which often adopt approaches from statistical physics) include our field’s understanding of memory capacity in associative neural networks [2], chaos and autocorrelation timescales in random neural networks [3], the paradoxical effect [4], and decision making [5]. Unfortunately, as circuit models include more biological realism, theory via analytical derivation becomes intractable. This creates an unfavorable tradeoff. On the one hand, one may tractably analyze systems of equations with unrealistic assumptions (for example symmetry or gaussianity), mathematically formalizing how parameters affect computation in a too-simple model. On the other hand, one may choose a more biologically accurate, scientifically relevant model at the cost of *ad hoc* approaches to analysis (such as simply examining simulated activity), potentially resulting in bad inference of parameters and thus erroneous scientific predictions or conclusions.

Of course, this same tradeoff has been confronted in many scientific fields characterized by the need to do inference in complex models. In response, the machine learning community has made remarkable progress in recent years, via the use of deep neural networks as a powerful inference engine: a flexible function family that can map observed phenomena (in this case the measurable signal of some computation) back to probability distributions quantifying the likely parameter configurations. One celebrated example of this approach from machine learning, of which we draw key inspiration for this work, is the variational autoencoder [6, 7], which uses a deep neural network to induce an (approximate) posterior distribution on hidden variables in a latent variable model, given data. Indeed, these tools have been used to great success in neuroscience as well, in particular for interrogating parameters (sometimes treated as hidden states) in models of both cortical population activity [8, 9, 10, 11] and animal behavior [12, 13, 14]. These works have used deep neural networks to expand the expressivity and accuracy of statistical models of neural data [15].

However, these inference tools have not significantly influenced the study of theoretical neuroscience models, for at least three reasons. First, at a practical level, the nonlinearities and dynamics of many theoretical models are such that conventional inference tools typically produce a narrow set of insights into these models. Indeed, only in the last few years has deep learning research advanced to a point of relevance to this class of problem. Second, the object of interest from a theoretical model is not typically data itself, but rather a qualitative phenomenon – inspection of model behavior, or better, a measurable signature of some computation – an *emergent property* of the model. Third, because theoreticians work carefully to construct a model that has biological relevance, such a model as a result often does not fit cleanly into the framing of a statistical model. Technically, because many such models stipulate a noisy system of differential equations that can only be sampled or realized through forward simulation, they lack the explicit likelihood and priors central to the probabilistic modeling toolkit.

To address these three challenges, we developed an inference methodology – ‘emergent property inference’ – which learns a distribution over parameter configurations in a theoretical model. This distribution has two critical properties: *(i)* it is chosen such that draws from the distribution (parameter configurations) correspond to systems of equations that give rise to a specified emergent property (a set of constraints); and *(ii)* it is chosen to have maximum entropy given those constraints, such that we identify all likely parameters and can use the distribution to reason about parametric sensitivity and degeneracies [16]. First, we stipulate a bijective deep neural network that induces a flexible family of probability distributions over model parameterizations with a probability density we can calculate [17, 18, 19]. Second, we quantify the notion of emergent properties as a set of moment constraints on datasets generated by the model. Thus, an emergent property is not a single data realization, but a phenomenon or a feature of the model, which is ultimately the object of interest in theoretical neuroscience. Conditioning on an emergent property requires a variant of deep probabilistic inference methods, which we have previously introduced [20]. Third, because we cannot assume the theoretical model has explicit likelihood on data or the emergent property of interest, we use stochastic gradient techniques in the spirit of likelihood free variational inference [21]. Taken together, emergent property inference (EPI) provides a methodology for inferring parameter configurations consistent with a particular emergent phenomena in theoretical models. We use a classic example of parametric degeneracy in a biological system, the stomatogastric ganglion [22], to motivate and clarify the technical details of EPI.

Equipped with this methodology, we then investigated three models of current importance in the-oretical neuroscience. These models were chosen to demonstrate generality through ranges of biological realism (from conductance-based biophysics to recurrent neural networks), neural system function (from pattern generation to abstract cognitive function), and network scale (from four to infinite neurons). First, we use EPI to produce a set of verifiable hypotheses of input-responsivity in a four neuron-type dynamical model of primary visual cortex; we then validate these hypotheses in the model. Second, we demonstrated how the systematic application of EPI to levels of task performance can generate experimentally testable hypotheses regarding connectivity in superior colliculus. Third, we use EPI to uncover the sources of error in a low-rank recurrent neural network executing a simple mathematical task. The novel scientific insights offered by EPI contextualize and clarify the previous studies exploring these models [23, 24, 25, 26], and more generally, these results point to the value of deep inference for the interrogation of biologically relevant models.

We note that, during our preparation and early presentation of this work [27, 28], another work has arisen with broadly similar goals: bringing statistical inference to mechanistic models of neural circuits ([29, 30, 31], preprint posted simultaneously with this preprint). We are encouraged by this general problem being recognized by others in the community, and we emphasize that these works offer complementary neuroscientific contributions (different theoretical models of focus) and use different technical methodologies (ours is built on our prior work [20], theirs similarly [32]). These distinct methodologies and scientific investigations emphasize the increased importance and timeliness of both works.

## 3 Results

### 3.1 Motivating emergent property inference of theoretical models

Consideration of the typical workflow of theoretical modeling clarifies the need for emergent property inference. First, one designs or chooses an existing model that, it is hypothesized, captures the computation of interest. To ground this process in a well-known example, consider the stomatogastric ganglion (STG) of crustaceans, a small neural circuit which generates multiple rhythmic muscle activation patterns for digestion [33]. Despite full knowledge of STG connectivity and a precise characterization of its rhythmic pattern generation, biophysical models of the STG have complicated relationships between circuit parameters and neural activity [22, 34]. A model of the STG [23] is shown schematically in Figure 1A, and note that the behavior of this model will be critically dependent on its parameterization – the choices of conductance parameters *z* = [*g*_{el}, *g*_{synA}]. Specifically, the two fast neurons (*f* 1 and *f* 2) mutually inhibit one another, and oscillate at a faster frequency than the mutually inhibiting slow neurons (*s*1 and *s*2). The hub neuron (hub) couples with either the fast or slow population or both.

Second, once the model is selected, one defines the emergent property, the measurable signal of scientific interest. To continue our running STG example, one such emergent property is the phenomenon of *network syncing* – in certain parameter regimes, the frequency of the hub neuron matches that of the fast and slow populations at an intermediate frequency. This emergent property is shown in Figure 1A at a frequency of 0.53Hz.

Third, qualitative parameter analysis ensues: since precise mathematical analysis is intractable in this model, a brute force sweep of parameters is done [23]. Subsequently, a qualitative description is formulated to describe the different parameter configurations that lead to the emergent property. In this last step lies the opportunity for a precise quantification of the emergent property as a statistical feature of the model. Once we have such a methodology, we can infer a probability distribution over parameter configurations that produce this emergent property.

Before presenting technical details (in the following section), let us understand emergent property inference schematically: EPI (Fig. 1A gray box) takes, as input, the model and the specified emergent property, and as its output, produces the parameter distribution shown in Figure 1B. This distribution – represented for clarity as samples from the distribution – is then a scientifically meaningful and mathematically tractable object. In the STG model, this distribution can be specifically queried to reveal the prototypical parameter configuration for network syncing (the mode; Figure 1B yellow star), and how network syncing decays based on changes away from the mode. The eigenvectors (of the Hessian of the distribution at the mode) quantitatively formalize the robustness of network syncing (Fig. 1B solid (*v*_{1}) and dashed (*v*_{2}) black arrows). Indeed, samples equidistant from the mode along these EPI-identified dimensions of sensitivity (*v*_{1}) and degeneracy (*v*_{2}) agree with error contours (Fig. 1B, contours) and have diminished or preserved network syncing, respectively (Figure 1B inset and activity traces) (see Section 5.2.1).

### 3.2 A deep generative modeling approach to emergent property inference

Emergent property inference (EPI) systematizes the three-step procedure of the previous section. First, we consider the model as a coupled set of differential (and potentially stochastic) equations [23]. In the running STG example, the model activity *x* = [*x*_{f1}, *x*_{f2}, *x*_{hub}, *x*_{s1}, *x*_{s2}] is the membrane potential for each neuron, which evolves according to the biophysical conductance-based equation:
where *C*_{m}=1nF, and *h*_{leak}, *h*_{Ca}, *h*_{K}, *h*_{hyp}, *h*_{elec}, and *h*_{syn} are the leak, calcium, potassium, hyper-polarization, electrical, and synaptic currents, all of which have their own complicated dependence on *x* and *z* = [*g*_{el}, *g*_{synA}] (see Section 5.2.1).

Second, we define the emergent property, which as above is network syncing: oscillation of the entire population at an intermediate frequency of our choosing (Figure 1A bottom). Quantifying this phenomenon is straightforward: we define network syncing to be that each neuron’s spiking frequency – denoted *ω*_{f1}(*x*), *ω*_{f2}(*x*), etc. – is close to an intermediate frequency of 0.53Hz. Mathematically, we achieve this via constraints on the mean and variance of *ω*_{α}(*x*) for each neuron *α* ∈ {f1, f2, hub, s1, s2}:
which completes the quantification of the emergent property.

Third, we perform emergent property inference: we find a distribution over parameter configurations *z*, and insist that samples from this distribution produce the emergent property; in other words, they obey the constraints introduced in Equation 2. This distribution will be chosen from a family of probability distributions , defined by a deep generative distribution of the normalizing flow class [17, 18, 19] – neural networks which transform a simple distribution into a suitably complicated distribution (as is needed here). This deep distribution is represented in Figure 1C (see Section 5.1). Then, mathematically, we must solve the following optimization program:
where *T* (*x*), *μ* are defined as in Equation 2, and *p*(*x*|*z*) is the intractable distribution of data from the model, *x*, given that model’s parameters *z* (we access samples from this distribution by running the model forward). The purpose of each element in this program is detailed in Figure 1D. Finally, we recognize that many distributions in will respect the emergent property constraints, so we require a normative principle to select amongst them. This principle is captured in Equation 3 by the primal objective *H*. Here we chose Shannon entropy as a means to find parameter distributions with minimal assumptions beyond some chosen structure [35, 36, 20, 37], but we emphasize that the EPI methodology is unaffected by this choice (although the results of course depend on the primal objective chosen).

EPI optimizes the weights and biases *θ* of the deep neural network (which induces the probability distribution) by iteratively solving Equation 3. The optimization is complete when the sampled models with parameters *z* ~ *q*_{θ} produce activity consistent with the specified emergent property (Fig. S4). Such convergence is evaluated with a hypothesis test that the mean of each emergent property statistic is not different than its emergent property value (see Section 5.1.2). Further validation of EPI is available in the supplementary materials, where we analyze a simpler model for which ground-truth statements can be made (Section 5.1.1). In relation to broader methodology, inspection of the EPI objective reveals a natural relationship to posterior inference. Specifically, EPI executes variational inference in an exponential family model, the sufficient statistics and mean parameter of which are defined by the emergent property statistics and values, respectively (see Section 5.1.4). Equipped with this method, we now prove out the value of EPI by using it to investigate and produce novel insights about three prominent models in neuroscience.

### 3.3 Comprehensive input-responsivity in a nonlinear sensory system

Dynamical models of excitatory (E) and inhibitory (I) populations with supralinear input-output function have succeeded in explaining a host of experimentally documented phenomena. In a regime characterized by inhibitory stabilization of strong recurrent excitation, these models give rise to paradoxical responses [4], selective amplification [38], surround suppression [39] and normalization [40]. Despite their strong predictive power, E-I circuit models rely on the assumption that inhibition can be studied as an indivisible unit. However, experimental evidence shows that inhibition is composed of distinct elements – parvalbumin (P), somatostatin (S), VIP (V) – composing 80% of GABAergic interneurons in V1 [41, 42, 43], and that these inhibitory cell types follow specific connectivity patterns (Fig. 2A) [44]. Recent theoretical advances [24, 45, 46], have only started to address the consequences of this multiplicity in the dynamics of V1, strongly relying on linear theoretical tools. Here, we go beyond linear theory by systematically generating and evaluating hypotheses of circuit model function using EPI distributions of neuron-type inputs producing various neuron-type population responses.

Specifically, we consider a four-dimensional circuit model with dynamical state given by the firing rate *x* of each neuron-type population *x* = [*x*_{E}, *x*_{P}, *x*_{S}, *x*_{V}]^{⊤}. Given a time constant of *τ* = 20 ms and a power *n* = 2, the dynamics are driven by the rectified and exponentiated sum of recurrent (*W*_{x}) and external *h* inputs:

The effective connectivity weights *W* were obtained from experimental recordings of publicly available datasets of mouse V1 [47, 48] (see Section 5.2.2). The input *h* = *b* + *dh* is comprised of a baseline input *b* = [*b*_{E}, *b*_{P}, *b*_{S}, *b*_{V}]^{⊤} and a differential input *dh* = [*dh*_{E}, *dh*_{P}, *dh*_{S}, *dh*_{V}]^{⊤} to each neuron-type population. Throughout subsequent analyses, the baseline input is *b* = [1, 1, 1, 1]^{⊤}.

With this model, we are interested in the differential responses of each neuron-type population to changes in input *dh*. Initially, we studied the linearized response of the system to input at the steady state response *x*_{ss}, i.e. a fixed point. All analyses of this model consider the steady state response, so we drop the notation *ss* from here on. While this linearization accurately predicts differential responses *dx* = [*dx* _{E}, *dx*_{P}, *dx*_{S}, *dx*_{V}]^{⊤} for small differential inputs to each population *dh* = [0.1, 0.1, 0.1, 0.1]^{⊤} (Fig 2B left), the linearization is a poor predictor in this nonlinear model more generally (Fig. 2B right). Currently available approaches to deriving the steady state response of the system are limited.

To get a more comprehensive picture of the input-responsivity of each neuron-type beyond linear theory, we used EPI to learn a distribution of the differential inputs to each population *dh* that produce an increase of *y* in the rate of each neuron-type population *α* ∈ {*E, P, S, V*}. We want to know the differential inputs *dh* that result in a differential steady state *dx*_{α} (the change in *x*_{α} when receiving input *h* = *b* + *dh* with respect to the baseline *h* = *b*) of value *y* with some small, arbitrarily chosen amount of variance 0.01^{2}. These statements amount to the emergent property

We maintain the notation throughout the rest of the study as short hand for emergent property, which represents a different signature of computation in each application.

Using EPI, we inferred the distribution of *dh* shown in Figure 2C producing . Columns correspond to inferred distributions of excitatory (*α* = *E*, red), parvalbumin (*α* = *P*, blue), so-matostatin (*α* = *S*, red) and VIP (*α* = *V*, green) neuron-type response increases, while each row corresponds to increase amounts of *y* ∈ {0.1, 0.5}. For each pair of parameters, we show the two-dimensional marginal distribution of samples colored by . The inferred distributions immediately suggest four hypotheses:

H1: as is intuitive, each neuron-type’s firing rate should be sensitive to that neuron-type’s direct input (e.g. Fig. 2C H1 gray boxes indicate low variance in

*dh*_{E}when*α*=*E*. Same observation in all inferred distributions);H2: the E- and P-populations should be largely unaffected by input to the V-population (Fig. 2C H2 gray boxes indicate high variance in

*dh*_{V}when*α*∈ {*E, P*});H3: the S-population should be largely unaffected by input to the P-population (Fig. 2C H3 gray boxes indicate high variance in

*dh*_{P}when*α*=*S*);H4: there should be a nonmonotonic response of the V-population with input to the E-population (Fig. 2C H4 gray boxes indicate that negative

*dh*_{E}should result in small*dx*_{V}, but positive*dh*_{E}should elicit a larger*dx*_{V});

We evaluate these hypotheses by taking perturbations in individual neuron-type input *δh*_{α} away from the modes of the inferred distributions at *y* = 0.1

Here *δx*_{α} is the change in steady state response of the system with input compared to *h* = *b* + *dh**, where is a unit vector in the dimension of *α*. The EPI-generated hypotheses are confirmed (for details, see Section 5.2.2):

H1: the neuron-type responses are sensitive to their direct inputs (Fig. 3A black, 3B blue, 3C red, 3D green);

H2: the E- and P-populations are not affected by

*δh*_{V}(Fig. 3A green, 3B green);H3: the S-population is not affected by

*δh*_{P}(Fig. 3C blue);H4: the V-population exhibits a nonmonotonic response to

*δh*_{E}(Fig. 3D black), and is in fact the only population to do so (Fig. 3A-C black).

These hypotheses were in stark contrast to what was available to us via traditional analytical linear prediction (Fig. 2C, magenta, see Section 5.2.2). To this point, we have shown the utility of EPI on relatively low-level emergent properties like network syncing and differential neuron-type population responses. In the remainder of the study, we focus on using EPI to understand models of more abstract cognitive function.

### 3.4 Identifying neural mechanisms of flexible task switching

In a rapid task switching experiment [49], rats were explicitly cued on each trial to either orient towards a visual stimulus in the Pro (P) task or orient away from a visual stimulus in the Anti (A) task (Fig. 4a). Neural recordings in the midbrain superior colliculus (SC) exhibited two populations of neurons that simultaneously represented both task context (Pro or Anti) and motor response (contralateral or ipsilateral to the recorded side): the Pro/Contra and Anti/Ipsi neurons [25]. Duan et al. proposed a model of SC that, like the V1 model analyzed in the previous section, is a four-population dynamical system. We analyzed this model, where the neuron-type populations are functionally-defined as the Pro- and Anti-populations in each hemisphere (left (L) and right (R)), their connectivity is parameterized geometrically (Fig. 4B). The input-output function of this model is chosen such that the population responses *x* = [*x*_{LP}, *x*_{LA}, *x*_{RP}, *x*_{RA}]^{⊤} are bounded from 0 to 1 giving rise to high (1) or low (0) responses at the end of the trial:
where *ϵ* = 0.05 and *ζ* = 0.5. The dynamics evolve with timescale *τ* = 0.09 via an internal variable *u* governed by connectivity weights *W*
with gaussian noise of variance *σ*^{2} = 1. The input *h* is comprised of a cue-dependent input to the Pro or Anti populations, a stimulus orientation input to either the Left or Right populations, and a choice-period input to the entire network (see Section 5.2.3). Here, we use EPI to determine the changes in network connectivity *z* = [*sW*_{P}, *sW*_{A}, *vW*_{P} _{A}, *vW*_{AP}, *dW*_{P} _{A}, *dW*_{AP}, *hW*_{P}, *hW*_{A}] resulting in greater levels of rapid task switching accuracy.

To quantify the emergent property of rapid task switching at various levels of accuracy, we considered the requirements of this model in this behavioral paradigm. At the end of successful trials, the response of the Pro population in the hemisphere of the correct choice must have a value near 1, while the Pro population in the opposite hemisphere must have a value near 0. Constraining a population response *x*_{α} ∈ [0, 1] to be either 0 or 1 can be achieved by requiring that it has Bernoulli variance (see Section 5.2.3). Thus, we can formulate rapid task switching at a level of accuracy *p* ∈ [0, 1] in both tasks in terms of the average steady response of the Pro population of the correct choice, the error in Bernoulli variance of that Pro neuron , and the average difference in Pro neuron responses *d* in both Pro and Anti trials:

Thus, denotes Bernoulli, winner-take-all responses between Pro neurons in a model executing rapid task switching near accuracy level *p*.

We used EPI to learn distributions of the SC weight matrix parameters *z* conditioned on of various levels of rapid task switching accuracy for *p* ∈ {50%, 60%, 70%, 80%, 90%}. To make sense of these inferred distributions, we followed the approach of Duan et al. by decomposing the con-nectivity matrix *W* = *V* Λ*V* ^{−1} in such a way (the Schur decomposition) that the basis vectors *v*_{i} are the same for all *W* (Fig. 4C). These basis vectors have intuitive roles in processing for this task, and are accordingly named the *all* mode - all neurons co-fluctuate, *side* mode - one side dominates the other, *task* mode - the Pro or Anti populations dominate the other, and *diag* mode - Pro- and Anti-populations of opposite hemispheres dominate the opposite pair. The corresponding eigenvalues (e.g. *λ*_{task}, which change according to *W*) indicate the degree to which activity along that mode is increased or decreased by *W*.

We found that for greater task accuracies, the task mode eigenvalue increases, indicating the importance of *W* to the task representation (Fig. 4D, purple; adjacent distributions from 60% to 90% have *p* < 10^{−4}, Mann-Whitney test with 50 estimates and 100 samples). Stepping from random chance (50%) networks to marginally task-performing (60%) networks, there is a marked decrease of the side mode eigenvalues (Fig. 4D, orange; *p* < 10^{−4}). Such side mode suppression relative to 50% remains in the models achieving greater accuracy, revealing its importance towards task performance. There were no interesting trends with task accuracy in the all or diag mode (hence not shown in Fig. 4). Importantly, we can conclude from our methodology that side mode suppression in *W* allows rapid task switching, and that greater task-mode representations in *W* increase accuracy. These hypotheses are confirmed by forward simulation of the SC model (Fig. 4E, see Section 5.2.3) suggesting experimentally testable predictions: increase in rapid task switching performance should be correlated with changes in effective connectivity corresponding to an increase in task mode and decrease in side mode eigenvalues.

### 3.5 Linking RNN connectivity to error

So far, each model we have studied was designed from fundamental biophysical principles, genetically- or functionally-defined neuron types. At a more abstract level of modeling, recurrent neural networks (RNNs) are high-dimensional dynamical models of computation that are becoming increasingly popular in neuroscience research [50]. In theoretical neuroscience, RNN dynamics usually follow the equation
where *x* is the network activity, *W* is the network connectivity, *ϕ*(·) = tanh(·), and *h* is the input to the system. Such RNNs are trained to do a task from a systems neuroscience experiment, and then the unit activations of the trained RNN are compared to recorded neural activity. Fully-connected RNNs with tens of thousands of parameters are challenging to characterize [51], especially making statistical inferences about their parameterization. Alternatively, we considered a rank-1, *N*-neuron RNN with connectivity consisting of the sum of a random and a structured component:

The random component *gχ* has strength *g*, and random component weights are Gaussian distributed . The structured component has entries of *m* and *n* drawn from Gaussian distributions and . Recent theoretical work derives the low-dimensional response properties of low-rank networks from statistical parameterizations of their connectivity, such as *z* = [*g, M*_{m}, *M*_{n}] [26]. We used EPI to infer the parameterizations of rank-1 RNNs solving an example task, enabling discovery of properties of connectivity that result in different types of error in the computation.

The task we consider is Gaussian posterior conditioning: calculate the parameters of a posterior distribution induced by a prior and a likelihood , given a single observation *y*. Conjugacy offers the result analytically; , where:

To solve this Gaussian posterior conditioning task, the RNN response to a constant input *h* = *yr* + (*n* − *M*_{n}) must equal the posterior mean along readout vector *r*, where

Additionally, the amount of chaotic variance Δ_{T} must equal the posterior variance. Theory for low-rank RNNs allows us to express *κ*_{r} and Δ_{T} in terms of each other through a solvable system of nonlinear equations (see Section 5.2.4) [26]. This theory facilitates the mathematical formalization of task execution into an emergent property, where the emergent property statistics of the RNN activity are *κ*_{r} and Δ_{T}, and the emergent property values are the ground truth posterior mean *μ*_{post} and variance :

We chose a substantial amount of variance in these emergent property statistics, so that the inferred distribution resulted in RNNs with a variety of errors in their solutions to the gaussian posterior conditioning problem.

EPI was used to learn distributions of RNN connectivity properties *z* = [*g*, *M*_{m}, *M*_{n}] executing Gaussian posterior conditioning given an input of *y* = 2, where the true posterior is *μ*_{post} = 3 and *σ*_{post} = 0.5 (Fig. 5A). We examined the nature of the over- and under-estimation of the posterior means (Fig. 5B left) and variances (Fig. 5B right) in the inferred distributions (300 samples). The symmetry in the *M*_{m}-*M*_{n} plane, suggests a degeneracy in the product of *M*_{m} and *M*_{n} (Fig. 5B). Indeed, *M*_{m}*M*_{n} strongly determines the posterior mean (*r* = 0.62, *p* < 10^{−4}). Furthermore, the random strength *g* strongly determines the chaotic variance (*r* = 0.56, *p* < 10^{−4}). Neither of these observations were obvious from what mathematical analysis is available in networks of this type (see Section 5.2.4). While the link between random strength *g* and chaotic variance Δ_{T} (and resultingly posterior variance in this problem) is well-known [3], the distribution admits a novel hypothesis: the estimation of the posterior mean by the RNN increases with *M*_{m}*M*_{n}.

We tested this prediction by taking parameters *z*_{1} and *z*_{2} as representative samples from the positive and negative *M*_{m}-*M*_{n} quadrants, respectively. Instead of using the theoretical predictions shown in Figure 5B, we simulated finite-size realizations of these networks with 2,000 neurons (e.g. Fig. 5C). We perturbed these parameter choices by *M*_{m}*M*_{n} clarifying that the posterior mean can be directly controlled in this way (Fig. 5D; *p* < 10^{−4}), see Section 5.2.4). Thus, EPI confers a clear picture of error in this computation: the product of the low rank vector means *M*_{m} and *M*_{n} modulates the estimated posterior mean while the random strength *g* modulates the estimated posterior variance. This novel procedure of inference on reduced parameterizations of RNNs conditioned on the emergent property of task execution is generalizable to other settings modeled in [26] like noisy integration and context-dependent decision making (Fig. S5).

## 4 Discussion

### 4.1 EPI is a general tool for theoretical neuroscience

Biologically realistic models of neural circuits are comprised of complex nonlinear differential equations, making traditional theoretical analysis and statistical inference intractable. We advance the capabilities of statistical inference in theoretical neuroscience by presenting EPI, a deep inference methodology for learning parameter distributions of theoretical models performing neural computation. We have demonstrated the utility of EPI on biological models (STG), intermediate-level models of interacting genetically- and functionally-defined neuron-types (V1, SC), and the most abstract of models (RNNs). We are able to condition both deterministic and stochastic models on low-level emergent properties like spiking frequency of membrane potentials, as well as high-level cognitive function like posterior conditioning. Technically, EPI is tractable when the emergent property statistics are continuously differentiable with respect to the model parameters, which is very often the case; this emphasizes the general applicability of EPI.

In this study, we have focused on applying EPI to low dimensional parameter spaces of models with low dimensional dynamical states. These choices were made to present the reader with a series of interpretable conclusions, which is more challenging in high dimensional spaces. In fact, EPI should scale reasonably to high dimensional parameter spaces, as the underlying technology has produced state-of-the-art performance on high-dimensional tasks such as texture generation [20]. Of course, increasing the dimensionality of the dynamical state of the model makes optimization more expensive, and there is a practical limit there as with any machine learning approach. Although, theoretical approaches (e.g. [26]) can be used to reason about the wholistic activity of such high dimensional systems by introducing some degree of additional structure into the model.

### 4.2 Novel hypotheses from EPI

In neuroscience, machine learning has primarily been used to reveal structure in large-scale neural datasets [52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62] (see review, [15]). Such careful inference procedures are developed for these statistical models allowing precise, quantitative reasoning, which clarifies the way data informs beliefs about the model parameters. However, these statistical models lack resemblance to the underlying biology, making it unclear how to go from the structure revealed by these methods, to the neural mechanisms giving rise to it. In contrast, theoretical neuroscience has focused on careful mechanistic modeling and the production of emergent properties of computation. The careful steps of *i*.) model design and *ii*.) emergent property definition, are followed by *iii*.) practical inference methods resulting in an opaque characterization of the way model parameters govern computation. In this work, we replaced this opaque procedure of parameter identification in theoretical neuroscience with emergent property inference, opening the door to careful inference in careful models of neural computation.

Biologically realistic models of neural circuits often prove formidable to analyze. Two main factors contribute to the difficulty of this endeavor. First, in most neural circuit models, the number of parameters scales quadratically with the number of neurons, limiting analysis of its parameter space. Second, even in low dimensional circuits, the structure of the parametric regimes governing emergent properties is intricate. For example, these circuit models can support more than one steady state [63] and non-trivial dynamics on strange attractors [64].

In Section 3.3, we advanced the tractability of low-dimensional neural circuit models by showing that EPI offers insights about cell-type specific input-responsivity that cannot be afforded through the available linear analytical methods [24, 45, 46]. By flexibly conditioning this V1 model on different emergent properties, we performed an exploratory analysis of a *model* rather than a dataset, generating a set of testable hypotheses, which were proved out. Furthermore, exploratory analyses can be directed towards formulating hypotheses of a specific form. For example, model parameter dependencies on behavioral performance can be assessed by using EPI to condition on various levels of task accuracy (See Section 3.4). This analysis identified experimentally testable predictions (proved out *in-silico*) of patterns of effective connectivity in SC that should be correlated with increased performance.

In our final analysis, we presented a novel procedure for doing statistical inference on interpretable parameterizations of RNNs executing simple tasks. Specifically, we analyzed RNNs solving a posterior conditioning problem in the spirit of [65, 66]. This methodology relies on recently extended theory of responses in random neural networks with low-rank structure [26]. While we focused on rank-1 RNNs, which were sufficient for solving this task, this inference procedure generalizes to RNNs of greater rank necessary for more complex tasks. The ability to apply the probabilistic model selection toolkit to RNNs should prove invaluable as their use in neuroscience increases.

EPI leverages deep learning technology for neuroscientific inquiry in a categorically different way than approaches focused on training neural networks to execute behavioral tasks [67]. These works focus on examining optimized deep neural networks while considering the objective function, learning rule, and architecture used. This endeavor efficiently obtains sets of parameters that can be reasoned about with respect to such considerations, but lacks the careful probabilistic treatment of parameter inference in EPI. These approaches can be used complementarily to enhance the practice of theoretical neuroscience.

## Data availability statement

The datasets generated during and/or analysed during the current study are available from the corresponding author upon reasonable request.

## Code availability statement

The software written for the current study is available from the corresponding author upon reasonable request.

## 5 Methods

### 5.1 Emergent property inference (EPI)

Emergent property inference (EPI) learns distributions of theoretical model parameters that produce emergent properties of interest by combining ideas from maximum entropy flow networks (MEFNs) [20] and likelihood-free variational inference (LFVI) [21]. Consider model parameteri-zation *z* and data *x* which has an intractable likelihood *p*(*x* | *z*) defined by a model simulator of which samples are available *x* ~ *p*(*x* | *z*). EPI optimizes a distribution *q*_{θ}(*z*) (itself parameterized by *θ*) of model parameters *z* to produce an emergent property of interest ,

Precisely, the emergent property statistics *T* (*x*) must equal the emergent property values *μ*, in expectation over the EPI distribution of parameters *q*_{θ}(*z*) and the distribution of simulated activity *p*(*x* | *z*). This is a viable way to represent emergent properties in theoretical models, as we have demonstrated in the main text, and enables the EPI optimization.

With EPI, we use deep probability distributions to learn flexible approximations to model parameter distributions *q*_{θ}(*z*). In deep probability distributions, a simple random variable *w* ~ *q*_{0}(*w*) is mapped deterministically via a sequence of deep neural network layers (*f*_{1}, ‥ *f _{l}*) parameterized by weights and biases

*θ*to the support of the distribution of interest:

Given a simulator defined by a theoretical model *x* ~ *p*(*x* | *z*) and some emergent property of interest , *q*_{θ}(*z*) is optimized via the neural network parameters *θ* to find a maximally entropic distribution within the deep variational family producing the emergent property:

Since we are optimizing parameters *θ* of our deep probability distribution with respect to the entropy *H*(*q*_{θ}(*z*)), we must take gradients with respect to the log probability density of samples from the deep probability distribution. Entropy of *q*_{θ}(*z*) can be expressed as an expectation of the negative log density of parameter samples *z* over the randomness in the parameterless initial distribution *q*_{0}:

Thus, the gradient of the entropy of the deep probability distribution can be estimated as an average of gradients of the log density of samples *z*:

In EPI, MEFNs are purposed towards variational learning of model parameter distributions. A closely related methodology, variational inference, uses optimization to approximate posterior distributions [68]. Standard methods like stochastic gradient variational Bayes [6] or black box variational inference [69] simply do not work for inference in theoretical models of neural circuits, since they require tractable likelihoods *p*(*x* | *z*). Work on likelihood-free variational inference (LFVI) [21], which like EPI seeks to do inference in models with intractable likelihoods, employs an addtional deep neural network as a ratio estimator, enabling an estimation of the optimization objective for variational inference. Like LFVI, EPI can be framed as variational inference (see Section 5.1.4). But, unlike LFVI, EPI uses a single deep network to learn a distribution and is optimized to produce an emergent property, rather than condition on data points. Optimizing the EPI objective is a technological challenge, the details of which we elaborate in Section 5.1.2. Before going through those details, we ground this optimization in a toy example.

#### 5.1.1 Example: 2D LDS

To gain intuition for EPI, consider a two-dimensional linear dynamical system (2D LDS) model (Fig. S1A): with

To run EPI with the dynamics matrix elements as the free parameters *z* = [*a*_{1}, *a*_{2}, *a*_{3}, *a*_{4}] (fixing *τ* = 1), the emergent property statistics *T* (*x*) were chosen to contain the first and second moments of the oscillatory frequency, 2*π*imag(*λ*_{1}), and the growth/decay factor, real(*λ*_{1}), of the oscillating system. *λ*_{1} is the eigenvalue of greatest real part when the imaginary component is zero, and alternatively of positive imaginary component when the eigenvalues are complex conjugate pairs. To learn the distribution of real entries of *A* that produce a band of oscillating systems around 1Hz, we formalized this emergent property as real(*λ*_{1}) having mean zero with variance 0.25^{2}, and the oscillation frequency 2*π*imag(*λ*_{1}) having mean *ω* = 1 Hz with variance (0.1Hz)^{2}:

Unlike the models we presented in the main text, this model admits an analytical form for the mean emergent property statistics given parameter *z*, since the eigenvalues can be calculated using the quadratic formula:

Importantly, even though is calculable directly via a closed form function and does not require simulation, we cannot derive the distribution directly. This fact is due to the formally hard problem of the backward mapping: finding the natural parameters *η* from the mean parameters *μ* of an exponential family distribution [70]. Instead, we used EPI to approximate this distribution (Fig. S1B). We used a real-NVP normalizing flow architecture with four masks, two neural network layers of 15 units per mask, with batch normalization momentum 0.99, mapped onto a support of *z*_{i} ∈ [−10, 10]. (see Section 5.1.3).

Even this relatively simple system has nontrivial (though intuitively sensible) structure in the parameter distribution. To validate our method, we analytically derived the contours of the probability density from the emergent property statistics and values. In the *a*_{1}-*a*_{4} plane, the black line at , dotted black line at the standard deviation , and the dotted gray line at twice the standard deviation follow the contour of probability density of the samples (Fig. S2A). The distribution precisely reflects the desired statistical constraints and model degeneracy in the sum of *a*_{1} and *a*_{4}. Intuitively, the parameters equivalent with respect to emergent property statistic real(*λ*_{1}) have similar log densities.

To explain the bimodality of the EPI distribution, we examined the imaginary component of *λ*_{1}. When , we have

When *τ* = 1 and *a*_{1}*a*_{4} > *a*_{2}*a*_{3} (center of distribution above), we have the following equation for the other two dimensions:

Since we constrained (with *ω* = 1), we can plot contours of the equation imag(*λ*_{1})^{2} = *a*_{1}*a*_{4} −*a*_{2}*a*_{3} = (2*π*)^{2} for various *a*_{1}*a*_{4} (Fig. S2B). With , we show the contours as *a*_{1}*a*_{4} = 0 (black), *a*_{1}*a*_{4} = −*σ*_{1,4} (black dotted), and *a*_{1}*a*_{4} = −2*σ*_{1,4} (grey dotted). This validates the curved structure of the inferred distribution learned through EPI. We took steps in negative standard deviation of *a*_{1}*a*_{4} (dotted and gray lines), since there are few positive values *a*_{1}*a*_{4} in the learned distribution. Subtler combinations of model and emergent property will have more complexity, further motivating the use of EPI for understanding these systems. As we expect, the distribution results in samples of two-dimensional linear systems oscillating near 1Hz (Fig. S3).

#### 5.1.2 Augmented Lagrangian optimization

To optimize *q*_{θ}(*z*) in Equation 17, the constrained optimization is executed using the augmented Lagrangian method. The following objective is minimized:
where , *η* ∈ ℝ^{m} are the Lagrange multipliers where *m* = |*μ*| = |*T* (*x*)|, and *c* is the penalty coefficient. These Lagrange multipliers are closely related to the natural parameters of exponential families (see Section 5.1.4). Deep neural network weights and biases *θ* of the deep probability distribution are optimized according to Equation 26 using the Adam optimizer with its standard parameterization [71]. *η* is initialized to the zero vector and adapted following each augmented Lagrangian epoch, which is a period of optimization with fixed (*η*, *c*) for a given number of stochastic optimization iterations. A low value of *c* is used initially, and conditionally increased after each epoch based on constraint error reduction. For example, the initial value of *c* was *c*_{0} = 10^{−3} during EPI with the oscillating 2D LDS (Fig. S1C). The penalty coefficient is updated based on the result of a hypothesis test regarding the reduction in constraint violation. The p-value of is computed, and *c*_{k+1} is updated to *βc*_{k} with probability 1 − *p*. The other update rule is given a batch size *n*. Throughout the study, *β* = 4.0, *γ* = 0.25, and the batch size was a hyperparameter, which varied according to the application of EPI.

The intention is that *c* and *η* start at values encouraging entropic growth early in optimization. With each training epoch in which the update rule for *c* is invoked by unsatisfactory constraint error reduction, the constraint satisfaction terms are increasingly weighted, resulting in a decreased entropy. This encourages the discovery of suitable regions of parameter space, and the subsequent refinement of the distribution to produce the emergent property. In the oscillating 2D LDS example, each augmented Lagrangian epoch ran for 2,000 iterations (Fig. S1C-D). Notice the initial entropic growth, and subsequent reduction upon each update of *η* and *c*. The momentum parameters of the Adam optimizer were reset at the end of each augmented Lagrangian epoch.

Rather than starting optimization from some *θ* drawn from a randomized distribution, we found that initializing *q*_{θ}(*z*) to approximate an isotropic Gaussian distribution conferred more stable, consistent optimization. The parameters of the Gaussain initialization were chosen on an application-specific basis. Throughout the study, we chose isotropic Gaussian initializations with mean *μ*_{init} at the center of the distribution support and some standard deviation *σ*_{init}, except for one case, where an initialization informed by random search was used (see Section 5.2.2).

To assess whether EPI distribution *q*_{θ}(*z*) produces the emergent property, we defined a hypothesis testing convergence criteria. The algorithm has converged when a null hypothesis test of constraint violations *R*(*θ*)_{i} being zero is accepted for all constraints *i* ∈ {1, …, *m*} at a significance threshold *α* = 0.05. This significance threshold is adjusted through Bonferroni correction according to the number of constraints *m*. The p-values for each constraint are calculated according to a two-tailed nonparametric test, where 200 estimations of the sample mean *R*(*θ*)^{i} are made from *k* resamplings of *z* from a finite sample of size *n* taken at the end of the augmented Lagrangian epoch. *k* is determined by a fraction of the batch size *ν*, which varies according to the application. In the linear two-dimensional system example, we used a batch size of *n* = 1000 and set *ν* = 0.1 resulting in convergence after the ninth epoch of optimization. (Fig. S1C-D black dotted line).

When assessing the suitability of EPI for a particular modeling question, there are some important technical considerations. First and foremost, as in any optimization problem, the defined emergent property should always be appropriately conditioned (constraints should not have wildly different units). Furthermore, if the program is underconstrained (not enough constraints), the distribution grows (in entropy) unstably unless mapped to a finite support. If overconstrained, there is no parameter set producing the emergent property, and EPI optimization will fail (appropriately). Next, one should consider the computational cost of the gradient calculations. In the best circumstance, there is a simple, closed form expression (e.g. Section 5.1.1) for the emergent property statistic given the model parameters. On the other end of the spectrum, many forward simulation iterations may be required before a high quality measurement of the emergent property statistic is available (e.g. Section 5.2.1). In such cases, optimization will be expensive.

#### 5.1.3 Normalizing flows

Deep probability models typically consist of several layers of fully connected neural networks. When each neural network layer is restricted to be a bijective function, the sample density can be calculated using the change of variables formula at each layer of the network. For *z*′ = *f* (*z*),

However, this computation has cubic complexity in dimensionality for fully connected layers. By restricting our layers to normalizing flows [17] – bijective functions with fast log determinant Jacobian computations, we can tractably optimize deep generative models with objectives that are a function of sample density, like entropy. Most of our analyses use either a planar flow [17] or real NVP [72], which have proven effective in our architecture searches. Planar flow architectures are specified by the number of planar bijection layers used, while real NVP architectures are specified by the number of masks, neural network layers per mask, units per layer, and batch normalization momentum parameter.

#### 5.1.4 Emergent property inference as variational inference in an exponential family

Now that we have fully described the EPI method, we consider its broader contextualization as a statistical method and its relation to Bayesian inference. In Bayesian inference a prior belief about model parameters *z* is formalized into a prior distribution *p*(*z*), and the statistical model capturing the effect of *z* on observed data points *x* is formalized in the likelihood distribution *p*(*x* | *z*). In Bayesian inference, we obtain a posterior distribution *p*(*z* | *x*), which captures how the data inform our knowledge of model parameters using Bayes’ rule:

The posterior disribution is analytically available when the prior is conjugate with the likehood. However, conjugacy is rare in practice, and alternative methods, such as variational inference [68], are utilized.

As we compare EPI to variational inference, it is important to consider that EPI is a maximum entropy method, and that maximum entropy methods have a fundamental relationship with exponential family distributions. A maximum entropy distribution of form: will have probability density in the exponential family:

The mappings between the mean parameterization *μ* and the natural parameterization *η* are formally hard to identify [70].

Now, consider the goal of doing variational inference with an exponential family posterior distribution *p*(*z* | *x*). We use the following abbreviated notation to collect the base measure *b*(*z*) and sufficient statistics *T* (*z*) into and likewise concatenate a 1 onto the end of the natural parameter . The log normalizing constant *A*(*η*(*x*)) remains unchanged:

Variational inference with an exponential family posterior distribution uses optimization to minimize the following divergence [68]:
*q*_{θ}(*z*) is the variational approximation to the posterior with variational parameters *θ*. We can write this KL divergence in terms of entropy of the variational approximation:

As far as the variational optimization is concerned, the log normalizing constant is independent of *q*_{θ}(*z*), so it can be dropped

Further, we can write the objective in terms of the first moment of the sufficient statistics : which simplifies to

In comparison, in emergent property inference (EPI), we solve the following problem:

The Lagrangian objective (without augmentation) is

Thus, as the optimization proceeds, should converge to the natural parameter through its adaptations in each epoch (see Section 5.1.2).

We have shown that there is indeed a clear relationship between Bayesian inference and EPI. Specifically, EPI is executing variational inference in an exponential family posterior, whose sufficient statistics are the emergent property statistics and mean parameterization are the emergent property values. However, in EPI we have not specified a prior distribution, or collected data, which can inform us about model parameters. Instead we have a mathematical specification of an emergent property, which the model must produce, and a maximum entropy selection principle. Accordingly, we replace the notation of *p*(*z* | *x*) with conceptualizing an inferred distribution that obeys emergent property (see Section 5.1).

### 5.2 Theoretical models

In this study, we used emergent property inference to examine several models relevant to theoretical neuroscience. Here, we provide the details of each model and the related analyses.

#### 5.2.1 Stomatogastric ganglion

We analyze how the parameters *z* = [*g*_{el}, *g*_{synA}] govern the emergent phenomena of network syncing in a model of the stomatogastric ganglion (STG) [23] shown in Figure 1A with activity *x* = [*x*_{f1}, *x*_{f2}, *x*_{hub}, *x*_{s1}, *x*_{s2}], using the same hyperparameter choices as Gutierrez et al. Each neuron’s membrane potential *x*_{α}(*t*) for *α* ∈ {f1, f2, hub, s1, s2} is the solution of the following differential equation:

The membrane potential of each neuron is affected by the leak, calcium, potassium, hyperpolarization, electrical and synaptic currents, respectively, which are functions of all membrane potentials and the conductance parameters *z*. The capacitance of the cell membrane was set to *C*_{m} = 1*nF*. Specifically, the currents are the difference in the neuron’s membrane potential and that current type’s reversal potential multiplied by a conductance:

The reversal potentials were set to *V*_{leak} = −40*mV*, *V*_{Ca} = 100*mV*, *V*_{K} = −80*mV*, *V*_{hyp} = −20*mV*, and *V*_{syn} = −75*mV*. The other conductance parameters were fixed to *g*_{leak} = 1 × 10^{−4}*μS*. *g*_{Ca}, *g*_{K}, and *g*_{hyp} had different values based on fast, intermediate (hub) or slow neuron. The fast conductances had values *g*_{Ca} = 1.9 × 10^{−2}, *g*_{K} = 3.9 × 10^{−2}, and *g*_{hyp} = 2.5×10^{−2}. The intermediate conductances had values *g*_{Ca} = 1.7 × 10^{−2}, *g*_{K} = 1.9 × 10^{−2}, and *g*_{hyp} = 8.0 × 10^{−3}. Finally, the slow conductances had values *g*_{Ca} = 8.5 × 10^{−3}, *g*_{K} = 1.5 × 10^{−2}, and *g*_{hyp} = 1.0 × 10^{−2}.

Furthermore, the Calcium, Potassium, and hyperpolarization channels have time-dependent gating dynamics dependent on steady-state gating variables *M*_{∞}, *N*_{∞} and *H*_{∞}, respectively:
where we set *v*_{1} = 0*mV*, *v*_{2} = 20*mV*, *v*_{3} = 0*mV*, *v*_{4} = 15*mV*, *v*_{5} = 78.3*mV*, *v*_{6} = 10.5*mV*, *v*_{7} = −42.2*mV*, *v*_{8} = 87.3*mV*, *v*_{9} = 5*mV*, and *v*_{th} = −25*mV*.

Finally, there is a synaptic gating variable as well:

When the dynamic gating variables are considered, this is actually a 15-dimensional nonlinear dynamical system.

In order to measure the frequency of the hub neuron during EPI, the STG model was simulated for *T* = 200 time steps of *dt* = 25*ms*. In EPI, since gradients are taken through the simulation process, the number of time steps are kept modest if possible. The chosen *dt* and *T* were the most computationally convenient choices yielding accurate frequency measurement. Poor resolution afforded by the discrete Fourier transform motivated the use of an alternative basis of complex exponentials to measure spiking frequency. Instead, we used a basis of complex exponentials with frequencies from 0.0-1.0 Hz at 0.01Hz resolution, Φ = [0.0, 0.01, *…,* 1.0]^{⊤}

Another consideration was that the frequency spectra of the neuron membrane potentials had several peaks. High-frequency sub-threshold activity obscured the maximum frequency measurement in the complex exponential basis. Accordingly, subthreshold activity was set to zero, and the whole signal was low-pass filtered with a moving average window of length 20. The signal was subsequently mean centered. After this preprocessing, the maximum frequency in the filter bank accurately reflected the firing frequency.

Finally, to differentiate through the maximum frequency identification, we used a sum-of-powers normalization. Let be the complex exponential filter bank dot products with the signal *x*_{α} ∈ ℝ^{N}, where *α* ∈ {f1, f2, hub, s1, s2}. The “frequency identification” vector is

The frequency is then calculated as with *β* = 100.

Network syncing, like all other emergent properties in this work, are defined by the emergent property statistics and values. The emergent property statistics are the first and second moments of the firing frequencies. The first moments were set to 0.53Hz, and the second moments were set to 0.025Hz^{2}:
for the EPI distribution shown in Fig. 1B. Throughout optimization, the augmented Lagrangian parameters *η* and *c*, were updated after each epoch of 2,500 iterations (see Section 5.1.2). The optimization converged after four epochs (Fig. S4).

For EPI in Fig 2C, we used a real NVP architecture with four masks and two layers of 10 units per mask, and batch normalization momentum of 0.99 mapped onto a support of *z* = [*g*_{el}, *g*_{synA}] ∈ [4, 8] × [0, 4]. We used an augmented Lagrangian coefficient of *c*_{0} = 10^{2}, a batch size *n* = 300, set *ν* = 0.1, and initialized *q*_{θ}(*z*) to produce an isotropic Gaussian with mean *μ*_{init} = [6, 2] with standard deviation *σ*_{init} = 0.5.

We calculated the Hessian at the mode of the inferred EPI distribution. The Hessian of a probability model is the second order gradient of the log probability density log *q*_{θ}(*z*) with respect to the parameters . With EPI, we can examine the Hessian, which is analytically available throughout distribution, to indicate the dimensions of parameter space that are sensitive (high magnitude eigenvalue), and which are degenerate (low magnitude eigenvalue) with respect to the emergent property produced. In Figure 1B, the eigenvectors of the Hessian *v*_{1} and *v*_{2} are shown evaluated at the mode of the distribution. The length of the arrows is inversely proportional to the square root of absolute value of their eigenvalues *λ*_{1} = −10.8 and *λ*_{2} = −2.27. We quantitatively measured the sensitivity of the model with respect to network syncing along the eigenvectors of the Hessian (Fig. 1B, inset). Sensitivity was measured as the slope coefficient of linear regression fit to network syncing error (the sum of squared differences of each neuron’s frequency from 0.53Hz) as a function of parametric perturbation magnitude (maximum 0.25) away from the mode along both orientations indicated by the eigenvector with 100 equally spaced samples. The sensitivity slope coefficient of eigenvector *v*_{1} with respect to network syncing was significant (*β* = 4.82 × 10^{−2}, *p* <10^{−4}). In contrast, eigenvector *v*_{2} did not identify a dimension of parameter space significantly sensitive to network syncing (*β* = 8.65 × 10^{−4} with *p* = .67). These sensitivities were compared to all other dimensions of parameter space (100 equally spaced angles from 0 to *π*), revealing that the Hessian eigenvectors indeed identified the directions of greatest sensitivity and degeneracy (Fig. 1B, inset). The contours of Figure 1 were calculated as error in *T* (*x*) from *μ* in both the first and second moment emergent property statistics.

#### 5.2.2 Primary visual cortex

The dynamics of each neural populations average rate *x* = [*x*_{E}, *x*_{P}, *x*_{S}, *x*_{V}]^{⊤} are given by:

By consolidating information from many experimental datasets, Billeh et al. [48] produce estimates of the synaptic strength (in mV) and connection probability

Multiplying these connection probabilities and synaptic efficacies gives us an effective connectivity matrix:

Theoretical work on these systems considers a subset of the effective connectivities [24, 45, 46]

In coherence with this work, we only keep the entries of *W*_{full} corresponding to parameters in Equation 61.

We look at how this four-dimensional nonlinear dynamical model of V1 responds to different inputs, and compare the predictions of the linear response to the approximate posteriors obtained through EPI. The input to the system is the sum of a baseline input *b* = [1, 1, 1, 1]^{⊤} and a differential input *dh*:

All simulations of this system had *T* = 100 time points, a time step *dt* = 5ms, and time constant *τ* = 20ms. The system was initialized to a random draw .

We can describe the dynamics of this system more generally by where the input to each neuron is

Let *F*_{ij} = *γ*_{i}*δ*(*i, j*), where *γ*_{i} = *f*′(*u*_{i}). Then, the linear response is
which is calculable by

This calculation is used to produce the magenta lines in Figure 2C, which show the linearly predicted inputs that generate a response from two standard deviations (of ) below and above *y*.

The emergent property we considered was the first and second moments of the change in steady state rate *dx*_{ss} between the baseline input *h* = *b* and *h* = *b* + *dh*. We use the following notation to indicate that the emergent property statistics were set to the following values:

In the final analysis for this model, we sweep the input one neuron at a time away from the mode of each inferred distributions . The differential responses *δx*_{α,ss} are examined at perturbed inputs where is a unit vector in the dimension of *α* and *δx* is evaluated at 101 equally spaced samples of *δh*_{α} from −15 to 15.

We measured the linear regression slope between neuron-types of *δx* and *δh* to confirm the hypotheses H1-H3 (H4 is simply observing the nonmonotonicity) and report the p values for tests of non-zero slope.

H1: the neuron-type ressponses are sensitive to their direct inputs. E-population:

*β*= 1.62,*p*< 10^{−4}(Fig. 3A black), P-population:*β*= 1.06,*p*< 10^{−4}(Fig. 3B blue), S-population:*β*= 6.80,*p*< 10^{−4}(Fig. 3C red), V-population:*β*= 6.41,*p*< 10^{−4}(Fig. 3D green).H2: the E-population (

*β*= 0,*p*= 1) and P-populations (*β*= 0,*p*= 1) are not affected by*δh*_{V}(Fig. 3A green, 3B green);H3: the S-population is not affected by

*δh*_{P}(*β*= 0,*p*= 1) (Fig. 3C blue);

For each with *α* ∈ {*E, P, S, V*} and *y* ∈ {0.1, 0.5}, we ran EPI using a real NVP architecture of four masks layers with two hidden layers of 10 units, mapped to a support of *z*_{i} ∈ [−5, 5] with no batch normalization. We used an augmented Lagrangian coefficient of *c*_{0} = 10^{5}, a batch size *n* = 1000, set *ν* = 0.5. The EPI distributions shown in Fig. 2 are the converged distributions with maximum entropy across random seeds.

We set the parameters of the Gaussian initialization *μ*_{init} and Σ_{init} to the mean and covariance of random samples that produced emergent property statistic *dx*_{α,ss} within a bound *ϵ* of the emergent property value *y*. *ϵ* = 0.01 was set to be one standard deviation of the emergent property value according to the emergent property value 0.01^{2} of the variance emergent property statistic.

#### 5.2.3 Superior colliculus

In the model of Duan et al [25], there are four total units: two in each hemisphere corresponding to the Pro/Contra and Anti/Ipsi populations. They are denoted as left Pro (LP), left Anti (LA), right Pro (RP) and right Anti (RA). Each unit has an activity (*x*_{α}) and internal variable (*u*_{α}) related by
where *α* ∈ {*LP, LA, RA, RP*} *ϵ* = 0.05 and *ζ* = 0.5 control the position and shape of the nonlinearity, respectively.

We order the elements of *x* and *u* in the following manner

The internal variables follow dynamics:
with time constant *τ* = 0.09*s* and Gaussian noise *σdB* controlled by the magnitude of *σ* = 1.0. The weight matrix has 8 parameters *sW*_{P}, *sW*_{A}, *vW*_{PA}, *vW*_{AP}, *hW*_{P}, *hW*_{A}, *dW*_{PA}, and *dW*_{AP} (Fig. 4B):

The system receives five inputs throughout each trial, which has a total length of 1.8s.

There are rule-based inputs depending on the condition, a choice-period input, and an input to the right or left-side depending on where the light stimulus is delivered.

The input parameterization was fixed to *I*_{P,rule} = 10, *I*_{A,rule} = 10, *I*_{choice} = 2, and *I*_{light} = 1.

To produce an accuracy rate of *p*_{LP} in the Left, Pro condition, let be the empirical average steady state response (final *x*_{LP} at end of task) over M=500 Gaussian noise draws for a given SC model parameterization *z*_{i}:
where stimulus *s* ∈ {*L, R*}, cue *c* ∈ {*P, A*}, and *σdB*_{j} is the Gaussian noise on trial *j*. As with the V1 model, we only consider steady state responses of *x*, so *x*_{α} is used from here on to denote the steady state activity at the end of the trial. For the first emergent property statistic, the average over EPI samples (from *q*_{θ}(*z*)) is set to the desired value *p*_{LP}:

For the next emergent property statistic, we ask that the variance of the steady state responses across Gaussian draws, is the Bernoulli variance for the empirical rate : where the Bernoulli variance error for the Pro task, left condition is

We have an additional constraint that the Pro neuron on the opposite hemisphere should have the opposite value (0 and 1). We can enforce this with another constraint:
where the distance between Pro neuron steady states *d*_{P} in the Pro condition is

The emergent property statistics only need to be measured during the Left stimulus condition of the Pro and Anti tasks, since the network is symmetrically parameterized. In total, the emergent property of rapid task switching at accuracy level *p* was defined as

Since the maximum variance of a random variable bounded from 0 to 1 is the Bernoulli variance , and the maximum squared difference between to variables bounded from 0 to 1 is 1, we do not need to control the second moment of these test statistics. These variables are dynamical system states and can only exponentially decay (or saturate) to 0 (or 1), so the Bernoulli variance error and squared difference constraints cannot be satisfied exactly in simulation. This is important to be mindful of when evaluating the convergence criteria. Instead of using our usual hypothesis testing criteria for convergence to the emergent property, we set a slack variable threshold only for these technically infeasible emergent property values to 0.05.

Using EPI to learn distributions of dynamical systems producing Bernoulli responses at a given rate (with small variance around that rate) was more challenging than expected. There is a pathology in this optimization setup, where the learned distribution of weights is bimodal attributing a fraction
*p* of the samples to an expansive mode (which always sends *x*_{LP} to 1), and a fraction 1 − *p* to a decaying mode (which always sends *x*_{LP} to 0). This pathology was avoided using an inequality constraint prohibiting parameter samples that resulted in low variance of responses across noise.

For each accuracy level *p*, we ran EPI for 10 different random seeds using an architecture of 10 planar flows with a support of *z* ∈ ℝ^{8}. We used an augmented Lagrangian coefficient of *c*_{0} = 10^{2}, a batch size *n* = 300, and set *ν* = 0.5, and initialized *q*_{θ}(*z*) to produce an isotropic Gaussian of zero mean with standard deviation *σ*_{init} = 1. The EPI distributions shown in Fig. 4 are the converged distributions with maximum entropy across random seeds.

We report significant correlations *r* and their p-values from Figure 4E in Table 1. Correlations were measured from 5,000 samples of and p-values are reported for one-tailed tests, since we hypothesized a positive correlation between task accuracies *p*_{P} or *p*_{A} and *λ*_{task}, and a negative correlation between task accuracies *p*_{P} and *p*_{A} and *λ*_{side}.

#### 5.2.4 Rank-1 RNN

Extensive research on random fully-connected recurrent neural networks has resulted in foundational theories of their activity [3, 73]. Furthermore, independent research on training these models to perform computations suggests that learning occurs through low-rank perturbations to the connectivity (e.g. [74, 75]). Recent theoretical work extends theory for random neural networks [3] to those with added low-rank structure [26]. In Section 3.5, we used this theory to enable EPI on RNN parameters conditioned on the emergent property of task execution.

Such RNNs have the following dynamics:
where *x* is network activity, *W* is the connectivity weight matrix, *ϕ*(·) = tanh(·) is the input-output function, and *h* is the input to the system. In a rank-1 RNN (which was sufficiently complex for the Gaussian posterior conditioning task), *W* is the sum of a random component with strength *g* and a structured component determined by the outer product of vectors *m* and *n*:
where , and the entries of *m* and *n* are distributed as and . For EPI, we consider *z* = [*g, M*_{m}, *M*_{n}], which are the parameters governing the connectivity properties of the RNN.

From such a parameterization *z*, the theory of Mastrogiuseppe et al. produces solutions for variables describing the low dimensional response properties of the RNN. These “dynamic mean field” (DMF) variables (e.g. the activity along a vector *κ*_{v}, the total variance Δ_{0}, structured variance Δ_{∞}, and the chaotic variance Δ_{T}) are derived to be functions of one another and connectivity parameters *z*. The collection of these derived functions results in a system of equations, whose solution must be obtained through a nonlinear system of equations solver. The iterative steps of this system of equations solver are differentiable, so we take gradients through this solve process. The DMF variables provide task-relevant information about the RNN’s response to task inputs.

In the Gaussian posterior conditioning example, *κ*_{r} and Δ_{T} are DMF variables used as task-relevant emergent property statistics *μ*_{post} and . Specifically, we solve for the DMF variables *κ*_{r}, *κ*_{n}, Δ_{0} and Δ_{∞}, where the readout is nominally chosen to point in the unit orthant *r* = [1, …, 1]^{⊤}. The consistency equations for these variables in the presence of a constant input *h* = *yr* − (*n* − *M*_{n}) can be derived following [26]:
where here *z* is a gaussian integration variable. We can solve these equations by simulating the following Langevin dynamical system to a steady state:

Then, the chaotic variance, which is necessary for the Gaussian posterior conditioning example, is simply calculated via Δ_{T} = Δ_{0} − Δ_{∞}.

We ran EPI using a real NVP architecture of two masks and two layers per mask with 10 units mapped to a support of *z* = [*g, M*_{m}, *M*_{n}] ∈ [0, 5] × [−5, 5] × [−5, 5] with no batch normalization. We used an augmented Lagrangian coefficient of *c*_{0} = 1, a batch size *n* = 300, set *ν* = 0.15, and initialized *q*_{θ}(*z*) to produce an isotropic Gaussian with mean *μ*_{init} = [2.5, 0, 0] with standard deviation *σ*_{init} = 2.0. The EPI distribution shown in Fig. 5 is the converged distributions with maximum entropy across five random seeds.

To examine the effect of product *M*_{m}*M*_{n} on the posterior mean, *μ*_{post} we took perturbations in *M*_{m}*M*_{n} away from two representative parameters *z*_{1} and *z*_{2} in 21 equally space increments from −1 to 1. For each perturbation, we sampled 10 2,000-neuron RNNs and measure the calculated posterior means. In Fig. 5D, we plot the product of *M*_{m}*M*_{n} in the perturbation versus the average posterior mean across 10 network realizations with standard error bars. The correlation between perturbation product *M*_{m}*M*_{n} and *μ*_{post} was measured over all simulations. For perturbations away from *z*_{1} the correlation was 0.995 with *p* < 10^{−4}, and for perturbations away from *z*_{2} the correlation was 0.983 with *p* < 10^{−4}.

In addition to the Gaussian posterior conditioning example in Section 3.5, we modeled two tasks from Mastrogiuseppe et al.: noisy detection and context-dependent discrimination. We used the same theoretical equations and task setups described in their study.

## Acknowledgements

This work was funded by NSF Graduate Research Fellowship, DGE-1644869, McKnight Endowment Fund, NIH NINDS 5R01NS100066, Simons Foundation 542963, NSF NeuroNex Award, DBI-1707398, The Gatsby Charitable Foundation, Simons Collaboration on the Global Brain Postdoctoral Fellowship, Chinese Postdoctoral Science Foundation, and International Exchange Program Fellowship. Helpful conversations were had with Francesca Mastrogiuseppe, Srdjan Ostojic, James Fitzgerald, Stephen Baccus, Dhruva Raman, Liam Paninski, and Larry Abbott.

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵