## Abstract

Two facts about cortex are widely accepted: neuronal responses show large spiking variability with near Poisson statistics and cortical circuits feature abundant recurrent connections between neurons. How these spiking and circuit properties combine to support sensory representation and information processing is not well understood. We build a theoretical framework showing that these two ubiquitous features of cortex combine to produce optimal sampling-based Bayesian inference. Recurrent connections store an internal model of the external world, and Poissonian variability of spike responses drives flexible sampling from the posterior stimulus distributions obtained by combining feedforward and recurrent neuronal inputs. We illustrate how this framework for sampling-based inference can be used by cortex to represent latent multivariate stimuli organized either hierarchically or in parallel. A neural signature of such network sampling are internally generated differential correlations whose amplitude is determined by the prior stored in the circuit, which provides an experimentally testable prediction for our framework.

## Introduction

In an uncertain and changing world, it is imperative for the brain to reliably represent and interpret external stimuli. The cortex is essential for the representation of the sensory world, and it is believed that populations of neurons collectively code for richly structured sensory scenes [1]. However, two central characteristics of cortical circuits remain to be properly integrated into population coding frameworks. First, neuronal activity in sensory cortices is often noisy, showing significant variability of spiking responses evoked by the same stimulus [2, 3]. In many traditional coding frameworks such spiking variability degrades the representation of stimuli by cortical activity [4]. Why cortical responses display large spiking variability while isolated cortical neurons can respond reliably remains a mystery. Second, the primary source of synaptic inputs to cortical neurons does not come from upstream centers which convey sensory signals, but rather from recurrent pathways between cortical neurons [5–7]. While such recurrent connections are often organized about a stimulus feature axis [8, 9], it is not obvious whether or how their presence improves overall representation. We propose a biologically motivated inference coding scheme where these two ubiquitous cortical circuit features, variability in spike generation and recurrent connections, together support a probabilistic representation of stimuli in rich sensory scenes.

Numerous studies have framed sensory processing in the cortex in terms of Bayesian inference (e.g., [10–16]). Specifically, the ‘Bayesian brain’ hypothesis posits that sensory cortex infers and synthesizes a posterior distribution of the latent stimuli which describes the probability of possible stimuli that could have given rise to the sensory inputs. Performing Bayesian inference requires cortex to store an internal model that represents how sensory inputs and external stimuli are generated. Once a sensory input is received, cortical dynamics inverts this internal model in a process termed ‘analysis-by-synthesis’ [12], and represents the posterior distributively across neurons and/or across time [15, 16]. In this study, we propose that recurrent connections in cortical circuits store the prior of latent stimuli to produce the posterior distribution when combined with evidence from sensory inputs. Moreover, we posit that Poisson spiking variability provides a source of fluctuations needed for generating random samples from the inferred posterior.

To test these hypotheses we consider a recurrent circuit model where neurons receive stochastic feedforward inputs which carry information about the external world, and respond with Poisson-distributed spiking activity. We find that such Poissonian spiking provides the variability that allows the network to generate samples from posterior stimulus distributions with differing uncertainties. We use this sampling framework to illustrate circuit-based Bayesian inference given two distinct generative models of stimuli in the external world: one organized hierarchically with a stimulus variable that depends on a latent context variable, and a second where a pair of latent stimuli are organized in parallel. In both cases a recurrent circuit is able to generate samples from the joint posterior, and infer the values of the latent variables. We show through both analytic derivation and simulations that recurrent connections represent the correlation structure of these models, and the weight of these connections can be tuned to optimally capture the prior distribution of stimuli in the external world. The stronger the correlation between the latent variables, the stronger the recurrent connections need to be for the network to generate samples from the correct posterior distribution.

Finally, a neural signature of this circuit-based sampling mechanism is internally generated population noise correlations aligned with the stimulus response direction, often referred to as “differential correlations” [4, 17]. In our framework, the amplitude of internally generated differential correlations is determined by the recurrent connection strength, which also determines the prior stored by the circuit. Since optimal inference requires a specific magnitude of recurrent connectivity, differential correlations resulting from such recurrent connectivity are a potential signature of optimal coding. This is in contrast to the deleterious impact of externally generated differential correlations. We thus predict that the correlation structure of the external world shapes recurrent wiring in neural circuits, and is reflected in the pattern of differential noise correlations. We use this logic to provide testable predictions from our framework for sampling-based Bayesian inference by recurrent, stochastic cortical circuits.

## Results

### Recurrent circuitry and spiking variability do not improve conventional neural codes

We start with the classic example of a sensory stimulus, *s*, encoded in neuronal population activity, **r**, from which a stimulus estimate *ŝ* can be decoded (Fig. 1A, top) [18]. It is reasonable to expect that neuronal circuitry is adapted to accurately represent ethologically relevant stimuli. However, as we will show next, in simple coding schemes two ubiquitous features of cortical circuits – internal spiking variability and recurrent connectivity – are at best irrelevant for, and in many cases degrade, the accuracy of these representations.

In population coding frameworks stimuli are encoded by a neuronal population with individual neurons tuned to a preferred stimulus value. The preferred values of all neurons cover the whole range of stimuli [18–20] (Fig. 1B, bottom); if *s* ranges over a periodic domain (such as the orientation of a bar in a visual scene, or the direction of an arm reach) then it is commonly assumed that the neurons’ preferred stimuli are distributed on a ring (Fig. 1B, top). To generate neuronal responses from such a population we simulate a network of neurons whose spiking activity, **r**_{t}, at time *t* is Poissonian with instantaneous firing rate *λ*_{t} (Eq. 11). For simplicity we assume linear (or linearized) neuronal transfer and synaptic interactions (Eqs. 10-11), so that the firing rate is a linear function of the feedforward and recurrent inputs. We couple excitatory (E) neurons with similar stimulus preferences more strongly [8, 9] to one another, compared to neuron pairs with dissimilar tuning. In this way the recurrent E connectivity has the same circular symmetry as the stimulus (Fig. 1B). In contrast, connections between inhibitory (I) neurons are unstructured, and inhibitory activity acts to stabilize network activity [21]. A stimulus, e.g. *s* = 0, results in elevated activity of E neurons with the corresponding preference (Fig. S1A). As expected, an increase in the strength of recurrent excitatory connections increases both the firing rates and the trial-to-trial pairwise covariability (i.e. noise correlations) in the responses [2] (Fig. S2A). This canonical network model has been widely used to explain cortical network dynamics and neural coding [21–23].

We use linear Fisher Information (LFI) to quantify the impact of recurrent connectivity and internal spiking variability on the accuracy of the stimulus estimate, *ŝ*_{t}, from the activity vector **r**_{t} (see details in Eq. S39 in Supplemental Information). The inverse of LFI provides a lower bound on the expected square of the difference between the true value, *s*, and the estimate, *ŝ*_{t}, made by a linear decoder [1, 4, 17–19, 24]. In the limit of an infinite number of neurons available to the decoder LFI is unaffected by recurrent connectivity strength, *w*_{E} (Fig. 1D, dashed line). This is because the mean response of the network is linear in its inputs, and an (invertible) linear transformation can neither increase nor decrease LFI (see Eq. S38 in Supplemental Information). For networks with a finite number of neurons, the variability from spike generation is shared between neurons via recurrent interactions. Consequently an increase in coupling strength, *w*_{E}, reduces LFI in finite networks (Fig. 1D, colored lines).

In sum, recurrent connectivity and spiking variability do not improve, and often degrade, stimulus representation in the network (as measured by LFI). Since synaptic coupling is biologically expensive, a network that most accurately and cheaply represents a stimulus is then one with no recurrent connections (i.e., *w*_{E} = 0) and minimal spiking variability. Nevertheless, connectivity in mammalian cortex is highly recurrent [5–7, 9], and neural responses are highly variable [2, 3]. What is then the purpose of these extensive recurrent connections between cortical neurons, and why are their responses so noisy?

While classical population code theory often explains how to generate point estimates of a stimulus (Fig. 1A), numerous studies suggest that the brain performs Bayesian inference to synthesize and estimate the probability distribution of latent stimuli from sensory inputs (e.g., [10–15, 25, 26]). To compute this posterior a neural circuit needs to combine a stored representation of the prior distribution of the stimulus with the likelihood conveyed by feedforward inputs. We propose that recurrent connectivity can be used to represent the prior and spiking variability can generate samples from this posterior distribution. Before we present our full model we first show how sampling-based inference can be implemented in a population of spiking neurons.

### Internally generated Poisson spiking variability drives sampling-based Bayesian inference

Many studies suggest that neuronal response variability is a signature of sampling in neural circuits (e.g., [16, 27–32]). In these studies the instantaneous population responses, **r**_{t}, represent a sample of a latent stimulus, and the empirical distribution of stimulus samples collected over time is an approximation of the posterior distribution. Furthermore, response variability is typically modeled using a continuous (e.g., Gaussian) distribution [27, 29–33]. However, spike trains from cortical neurons are often Poissonian, and spike counts are discrete [3, 34]. It is unclear if discrete Poissonian variability can generate samples from stimuli with continuous probability distributions (e.g., orientation, moving direction) with the flexibility needed to represent different stimulus uncertainties.

We address this question using a theory based on a simple model network composed of excitatory (E) Poissonian neurons (Eqs. 10-11), and subsequently support our findings by simulating a network containing both E and inhibitory (I) neurons (e.g. Fig. 1B). We start by showing that Poissonian spiking in a population of tuned neurons can drive sampling from a well–defined distribution. We assume that the instantaneous firing rates of a population of E neurons, *λ*_{t}, have a bell-shaped (Gaussian) profile (Fig. 2B), so that for the *j*^{th} neuron (See Eq. 12 in Methods). Here *θ*_{j} is the preferred stimulus of neuron *j, a* is the width of the tuning curve, and is the location of the peak of the firing rate profile, *λ*_{t}, in stimulus space (x-axis in Fig. 2B). Note that the value of is arbitrary here, but we will later relate it to the input to the population. The (smooth) Gaussian tuning curves simplify the analysis, but are not essential for the argument. Finally, the preferred stimuli of the E neurons, , are uniformly distributed over the stimulus range (Fig 1B). In each time interval the population activity is given by a vector of independent Poisson random variables, **r**_{t}, with means determined by the instantaneous firing rate vector *λ*_{t} (Fig. 2B-C). At each time, *t*, this spiking activity produces a stimulus sample, , from the probability distribution determined by the instantaneous firing rates, *λ*_{t} (Fig. 2D, see Methods),

With the Gaussian firing rate profile we use here, the stimulus sample, , can be read out as (Eq. 14 and Fig. 2D), which can be thought of as the location of the response, **r**_{t}, in stimulus space (y-axis in Fig. 2C). The collection of stimulus samples across time (; Fig. 2E), determines the sampling distribution which approximates the distribution *p*(*s*|*λ*_{t}), i.e., *p*(*s*|*λ*_{t}) ≈ *q*(*s*) [16, 35]. Here *δ*(·) is the Dirac delta function and *T* is the number of samples.

To use this mechanism to produce samples from the posterior distribution of a stimulus, we must define a generative model for the feedforward inputs evoked by a stimulus. We take the feedforward input to the neural population, **u**^{f}, to be a vector of independent Poisson spike counts with Gaussian tuning over the stimulus, *s*. Following assumptions widely used in previous studies of probabilistic population codes (PPC) [36, 37], we assume that the mean input spike count to the *j*^{th} excitatory neuron in the population is . A single realization of the input, **u**^{f}, in a time interval encodes the whole likelihood function over the stimulus, *p*(**u**^{f}|*s*) [36]. This likelihood is proportional to a Gaussian due to the Gaussian profile of feedforward input (Eq. 19),
Here the likelihood mean, *μ*_{f}, is determined by the location of **u**^{f} in stimulus space, and the precision, Λ_{f}, is proportional to the spike count (or height) of **u**^{f} (Eq. 20). Since a realization of the feedforward input encodes the whole likelihood function, we present a fixed **u**^{f} to the network over time (dropping the time index *t*), and describe how samples from the posterior *p*(*s*|**u**^{f}) are generated by the network.

A simple example of inference via sampling is provided by a population of E neurons without recurrent connections and instantaneous firing rates equal to the feedforward input, *λ*_{t} = **u**^{f} (Eq. 10), and hence constant in time (Fig. 2A). In this feedforward network Poisson spike generation produces samples from the normalized likelihood, i.e., , and consequently the network represents a uniform stimulus prior (i.e., *p*(*s*) is a constant).

To test our theory, we simulated the response of a network of tuned excitatory (E) and untuned inhibitory (I) neurons (Fig. 2A,C) to a fixed but randomly generated feedforward input (Eq. 18). While the E neurons shared no recurrent connections, the E and I neurons were connected to maintain stable network activity. To confirm that the overall firing rate dictated the sampling variability (Eq. 1), we increased the feedforward input rate, which reduced the width of the likelihood (Eq. 2). As a result, the sampling precision (inverse of the sampling variance) increased and matched the precision of the likelihood (Fig. 2G, H), even as the normalized response variability (measured the by Fano factor) of single neurons remained unchanged.

While the above analysis introduces the key components of a sampling-based theory of inference, stimulus sampling using a feedforward network is unnecessary: A single observation of the response **r** in a deterministic feedforward network (**r** = **u**^{f} after removing spike generation in Eq. 11) would also represent the whole likelihood [36], avoiding the costly process of collecting samples across time. We next consider more interesting cases, and show that spiking variability in recurrent networks can drive sampling from more complex posterior distributions.

### Recurrent cortical circuit samples a hierarchical generative model

Recurrent networks can store a variety of generative model structures; to demonstrate the generality of our sampling framework we provide two example generative models which serve as building blocks for more complex models. We first consider a two-stage hierarchical model of feedforward inputs received by the cortical circuit (Fig. 3A). The first stage of our model consists of a stimulus, *s*, and a context, *z*, both of which are one dimensional for simplicity. The structure of the world is described by the joint distribution, *p*(*s, z*). Using the visual system as motivation, *s*, could be the orientation of the visual texture within a classical receptive field (local information) of a hypercolumn of V1 neurons, while the orientation within a non-classical receptive field of these cells could describe the corresponding context, *z* (Fig. 3A). The likelihood of the stimulus based on a given context, , is Gaussian with precision Λ_{s}. For simplicity, we assume that the context prior, *p*(*z*), is uniform, which implies that the marginal prior of *s*, is also uniform (Fig. 3B). This assumption is not essential for our main conclusions but does simplify the analysis. Importantly, the joint prior of stimulus and context, *p*(*s, z*), can have non-trivial structure with the density concentrated around the diagonal *s* = *z* (Fig. 3B). The precision Λ_{s} measures how strongly the context, *z*, and the stimulus, *s*, are related, and thus determines how strongly their joint distribution is concentrated around the diagonal.

The second stage of the generative model describes how the feedforward input depends on the stimulus, *s*; this is identical to our prior treatment (See Eq. 2). Combining these two stages provides a complete description of the generative model for the feedforward input received by neurons in the population,
Given this hierarchical model we can show that the joint posterior over stimulus and context features, *p*(*s, z*|**u**^{f}) is a bivariate normal distribution (see Eq. 24), and we next use it to evaluate the accuracy of the sampling distribution.

### Gibbs sampling of the joint stimulus and context posterior

One approach to approximate the joint distribution over stimulus and context is Gibbs sampling [31, 35, 40, 41] which starts with an initial guess for the value of the two latent variables, and proceeds by alternately generating samples of one variable from the distribution conditioned on the value of the second variable. More precisely, to approximate the joint posterior of *s* and *z* (Eq. 3), Gibbs sampling proceeds by generating a sequence of samples, indexed by time *t*, through recursive iteration of the following steps (Fig. 3C and Eq. 25),
Here Δ*t* is the time increment between successive samples. The samples (red dots in Fig. 3D) are generated by alternately fixing the values of the two variables, so that sampling trajectories alternate between horizontal and vertical jumps (cyan lines in Fig. 3D). The empirical distribution of samples, i.e., with T denoting vector transpose, approximates the joint posterior *p*(*s, z*|**u**^{f}) (blue contour map in Fig. 3D, Eq. 24) [35]. To approximate *p*(*s*|**u**^{f}), the marginal posterior distribution of *s*, we can use only samples to obtain the approximating distribution *q*(*s*|**u**^{f}) (compare the two green lines at the margin in Fig. 3D). The same is true for the marginal posterior over *z*.

### Implementing Gibbs sampling of stimulus and context in a recurrently coupled cortical circuit

An implementation of Gibbs sampling in a recurrent E circuit can be intuitively understood by comparing the recurrent network dynamics (Fig. 4A) with the dynamics described by the Gibbs sampling algorithm (Fig. 3C). In the recurrent network a stimulus sample, , is represented by the activity of E cells, **r**_{t}, while a context sample, , is represented by recurrent inputs, . To generate correct samples we require that the conditional distribution that is represented by the instantaneous firing rate, *λ*_{t} (Eq. 1), matches the conditional distribution used in the Gibbs sampling algorithm (Eq. 4b), so that . Equating the two distributions (see Eqs. 4a and 10) yields the relation,

This equation holds when two constraints are satisfied: First, the firing rate vector, *λ*_{t}, needs to have a Gaussian profile peaked at , i.e., the mean of (Eq. 4a). Second, the peak firing rate, *R*, needs to be proportional to the precision of , i.e., *R* ∝ Λ (see Fig. 2F-G). In a neural circuit one way for *λ*_{t} to satisfy these constraints is for feedforward inputs, **u**^{f}, and recurrent inputs, , to both have Gaussian profiles with the same width, *a*, as that of *λ*_{t} (by sharing the same , Eqs. 5 and 12). This is because the sum of two Gaussian-profile inputs with the same width, *a*, gives a firing rate, *λ*_{t}, with the same tuning, as long as the difference of the locations of two inputs is much smaller than the width, *a*. Our generative model (Eq. 3) produces feedforward input, **u**^{f}, with a Gaussian profile and encodes the likelihood function . The recurrent input, , then need to represent the conditional distribution . Hence, to satisfy Eq. (5) the recurrent input should have the same Gaussian profile as **u**^{f} (Eq. 29), with its location and magnitude determined by the mean and precision of , respectively.

If recurrent interactions are absent (setting ), then network activity, **r**_{t}, generates samples from the normalized likelihood, , as we showed previously when describing feedforward networks (Fig. 2). When neurons only receive recurrent inputs (setting **u**^{f} = 0), the network generates samples from the conditional distribution . Driven by a sum of recurrent and feedforward inputs the network generates samples from a distribution given by the product of the conditional distributions encoded by both inputs respectively (Fig. 4B-C).

The recurrent weights must be adjusted so that the recurrent input has the appropriate magnitude and width to encode the likelihood *p*(*s*|*z*). To simplify the exposition we first assume that E neurons are only self-connected, so that the width of recurrent input trivially matches that of the feedforward input (otherwise recurrence will broaden the profile of the firing rate activity *λ*_{t} over the network). To constrain the magnitude of the recurrent weights we require that the sum of the recurrent inputs satisfies . Since and the width of and **r**_{j} are equal, the magnitude of the recurrent weights that result in samples from the correct posterior must satisfy:
where Λ_{s} and Λ_{f} are the precision of likelihood *p*(*s*|*z*) and *p*(**u**^{f}|*s*) respectively (Eq. 3). The optimal recurrent weight, , thus encodes the correlation between the stimulus *s* and the context *z*. An increase in correlation between *s* and *z*, resulting in a narrower diagonal band in *p*(*s, z*) (Fig. 3B), requires an increase in the recurrent weight for optimal sampling. When context and stimulus are uncorrelated so that Λ_{s} = 0, the hierarchical generative model (Fig. 3A) is equivalent to the generative model without context (Fig. 2A) and recurrent interactions are not needed for sampling (i.e., ). Our framework (Eq. 6) thus predicts that optimal Bayesian inference is achieved with recurrent synaptic weights which depend on the correlative structure of the external world. We numerically test this prediction in the next section.

### A stochastic E-I spiking network jointly samples stimulus and context

To confirm the predictions of this analysis, we simulated a full recurrent network consisting of both E and I neurons with Poisson spike train statistics (see details in Eqs. 47-50). The E neurons were synaptically connected to each other (Eq. 49, see Fig. 1A), in contrast to the simple network of self-connected E neurons we described above. While recurrent E to E coupling broadens the tuning of excitatory recurrent input, lateral inhibition can sharpen Gaussian firing rate profiles so that it matches that of the feedforward inputs (as required by Eq. 5).

The activity of the recurrent network in response to a fixed but randomly generated feedforward input (Eq. 3) can be decoded to produce samples from the bivarite posterior distribution of the stimulus and context. As above, samples from the conditional stimulus distribution are represented by the activity of E neurons (Eq. 14), while samples from the conditional context distribution are represented by recurrent inputs received by E neurons (Eq. 29; black curves overlaid on the top of population responses in Fig. 4D and E respectively). To update recurrent inputs we only used neuronal activity at the previous time step. Thus, the activities of E neurons and their recurrent inputs were updated in alternation, consistent with Gibbs sampling. The trajectory obtained by plotting the stimulus sample read out from the network activity on one axis, and plotting the context sample read out from recurrent E inputs on another axis then exhibits the characteristics of Gibbs sampling (Fig. 4F, cyan line). The resulting sampling distribution provides a good approximation to the joint posterior of stimulus and context (compare red dots and blue contour in Fig. 4F). Inhibitory neurons again did not respond selectively to either the stimulus or the context.

For the network to generate samples from the joint posterior, the recurrent connectivity should depend on the correlation between the stimulus and the context (Eq. 6). To verify this prediction, we fixed the generative model (Eq. 3) and changed only the recurrent weights in the network. For simplicity, we only varied the peak E weight, *w*_{E} (Eq. 49), and maintained network stability by fixing the ratio between E and I synaptic weights. While increasing *w*_{E} did not change the sampling mean, it did increase the variance of the context sampling distribution, and increased the correlation between stimulus and context samples (Fig. 5A).

We use Kullback-Leibler (KL) divergence to measure the distance between the sampling distribution, *q*(*s, z*|**u**^{f}), and the true posterior, *p*(*s, z*|**u**^{f}) (Eq. 24). The KL divergence quantifies the loss of mutual information, measured in bits, between the latent variables (*s* and *z*) and the feedforward inputs, **u**^{f}, when the true posterior, *p*, is approximated by the distribution, *q* (Eq. 42) [35]. The mutual information loss in the network is minimized at a unique value of the recurrent weight, , at which the sampling distribution, *q*, best matches the posterior, *p* (Fig. 5B, black circle). To confirm that this optimal recurrent weight, , increases with the correlation in the prior (precision Λ_{s}, Eq. 6), we numerically obtained the recurrent weight that minimizes the mutual information loss for each value of Λ_{s} in the generative model. These results confirmed the predictions of our theory (Eq. 6, Fig. 5C): When Λ_{s} = 0, i.e. when context and stimulus are uncorrelated, a network with no interactions performs best , while for small Λ_{s} (relative to Λ_{f}) the optimal weight is positive and increases with Λ_{s}. In total, we have described a potential mechanism for a recurrent network of spiking neurons to perform sampling-based Bayesian inference.

### Generating samples from multi-dimensional posteriors with coupled neural circuits

To demonstrate the generality of the proposed neural code we next consider a world described by a broad, rather than deep (hierarchical) generative model. Information about each of two latent stimuli, **s** = (*s*_{1}, *s*_{2}), is relayed by corresponding feedforward inputs received by a neural circuit (Fig. 6A). We assume the prior is a bivariate Gaussian distribution (Fig. 6B), i.e., , so that Λ_{s} (Λ_{s} ≥ 0) characterizes the correlation between *s*_{1} and *s*_{2}. Furthermore, each stimulus, *s*_{m}, independently generates feedforward spiking inputs, , each of which is received by a separate network and produces responses **r**_{m} for *m* = 1, 2 (Fig. 6A).

Thus, the full generative model of the input has the form,
The likelihood is the same as that given previously (Eq. 2), where the feedforward inputs, , are again described by conditionally independent Poisson spike counts with Gaussian tuning over stimulus *s*_{m}. As a concrete example, the two stimuli, *s*_{m}, could represent orientations of local edges falling in the central receptive fields of a V1 hypercolumn (Fig. 6A, bottom), with each V1 hypercolumn modeled by a network producing the response **r**_{m} (Fig. 6A, top). Then Λ_{s} characterizes *a priori* tendency of the stimuli to share similar orientations, and determines how likely two local edges are to be part of a global line, as in the case of contour integration [42, 43]. However, the generative model defined by Eq. (7) is quite general and has been also used to explain multisensory cue integration [10] and sensorimotor learning [13].

The posterior is a bivariate Gaussian distribution (Fig. 6D, Eq. 34) whose mean is shifted from the likelihood mean (Fig. 6C) towards to the diagonal line, because of the correlations between the stimuli in the prior (Fig. 6B). We can again use Gibbs sampling to approximate the posterior *p*(**s**|**u**^{f}) using the following steps,
where and are instantaneous samples at time *t* of stimuli *s*_{1} and *s*_{2} respectively. We only give the steps needed to produce samples from the conditional distribution of *s*_{1}, as samples from the conditional distribution of *s*_{2} can be obtained using the same steps after exchanging indices.

These sampling steps can be implemented distributively in a coupled neural circuit using a mechanism similar to that we described in the case of a hierarchical generative model. The activity of each network, **r**_{m}, individually represents samples from the (marginal) posterior of *s*_{m} (Fig. 6A, top). The joint posterior is then approximated as the collection of samples represented by the activity pairs (**r**_{1}, **r**_{2}). Taking network *m* = 1 as an example, spike response **r**_{1t} produces a stimulus sample as long as the instantaneous firing rate *λ*_{1t} represents the conditional distribution (Eq. 8a). Since the feedforward input, , represents the likelihood , to obtain the appropriate firing rates, *λ*_{1t}, the recurrent input from network 2 to network 1, , must encode the correct conditional distribution, . As in the case of the mechanism we proposed to implement sampling as described by Eq. (5), needs to have the same Gaussian profile as the firing rate *λ*_{1t}, the position of on the stimulus space should match the mean of , i.e., , and the magnitude of must be proportional to the prior correlation, (Eq. 39). Hence, each network can sum the feedforward input and the recurrent input from its counterpart to obtain an update to the instantaneous conditional distribution given by Eq. (8a), and generate independent Poisson spikes to produce a sample from the instantaneous conditional distribution (Eq. 8b). Notably, the sample of each stimulus can be locally read out from corresponding network (Eq. 41, Fig. 6A), even if the activities of two networks are correlated.

Since the recurrent input strength represents the stimulus correlation in the prior determined by precision Λ_{s}, the coupling between the two networks needs to be tuned to generate the appropriate recurrent input. Indeed, in a network with only E neurons, and connections only between neurons with the same preferred stimulus value but in different networks, the optimal homogeneous connection strength is (Eq. 40). This mirrors the result obtained with the hierarchical model presented earlier in Eq. (6).

### Coupled E-I spiking networks sample bivariate dimensional posteriors

To test the feasibility of the proposed mechanisms for generating samples from a bivariate posterior we simulated a pair of bidirectionally coupled circuits consisting of E and I neurons (Fig. 7A).

This neural circuit model can be extended to generate samples from higher dimensional posterior distribution (see Discussion). Each circuit receives feedforward input generated by one of the two stimuli. On every time step the sample of each stimulus, , can be individually and linearly read out from the response of corresponding network, **r**_{mt} (Eq. 41). Jointly, the two stimulus samples, one each from both networks, , provide a sample from the joint posterior of the two latent stimuli (Fig. 7B). We assumed that the synaptic connections between the networks, *w*_{mn} (*m, n* = 1, 2; *m* ≠ *n*), are excitatory, but target both E and I neurons, while inhibitory connections are local to each network. We also adjusted network parameters so that the profiles of the inputs across networks (e.g., the inputs from network 2 to 1) have the same tuning profile as the feedforward inputs (see Methods). Since we assumed uniform marginal priors (see Eq. 32), recurrent connections between E neurons within the a circuit were absent, while E and I neurons within a circuit were recurrently connected to ensure network stability. For simplicity, we chose parameters so that the two circuits were symmetric, but the strength of the feedforward inputs to each could differ.

We asked whether the activity of the two coupled circuits can generate samples from bivariate posteriors, and how the sampling distribution depends on the coupling, *w*_{mn}, between the two circuits. An increase in synaptic coupling between the two networks caused the sampling distribution to shift from the likelihood mean towards the diagonal (Fig. 7B), resulting in stimulus samples, and that were more similar. This is consistent with an increase in stimulus correlation in the multivariate prior, Λ_{s} (Eq. 7). To confirm our prediction that the optimal coupling strength between the two networks, , increases with the stimulus correlation in the prior, Λ_{s}, we numerically obtained the coupling weight that minimizes the loss of mutual information between latent stimuli and feedforward inputs (Fig. 7C). The optimal synaptic weight between the circuits increased with stimulus correlation in the prior. At the optimal weight, , the sampling distribution was close to the true posterior, showing that a properly tuned circuit can generate samples from the correct distribution (Fig. 7D).

We next asked how the sampling distribution in the network depends on network and feedforward input parameters. As the coupling between the two circuits increased, the sample means of both stimuli converge (Fig. 7E, top) and the sampling precision of both stimuli increased as well (Fig. 7E, bottom), in agreement with a more correlated stimulus prior. We also tested whether a network with fixed parameters can generate samples from a family of posteriors with different uncertainties. To do so, we changed the uncertainty of the likelihood of *s*_{1} by changing the firing rate in the feedforward input received by network 1. We observed that with a narrower likelihood of *s*_{1}, the sample means of both stimuli shifted towards the mean of likelihood of *s*_{1} (−10°), and sampling precision increased, consistent with a change in the posterior distribution (Fig. 7F). Lastly, to demonstrate the robustness of this network implementation of sampling-based inference we compare the sampling distributions to the true posteriors under different combinations of input and network parameters (Fig. 7G-H), in each case setting the recurrent coupling to the optimal value, , obtained numerically. Across different parameter values we observe excellent agreement in both the mean (Fig. 7G) and precision (Fig. 7H) of the two densities. In sum, our recurrent network of spiking neuron models can be extended to support sampling-based Bayesian inference with multi-dimensional stimuli.

### A signature of stimulus sampling: internally generated differential noise correlations

A central prediction of our circuit framework for sampling-based Bayesian inference is that an increase in the correlation between stimuli in the sensory world should result in stronger synapses between neurons whose activities represent these stimuli (see Eq. 6). This is a difficult prediction to test since measuring synaptic connectivity along a functional axis is already challenging [44], let alone measuring a change in synaptic strength owing to a change in stimulus statistics. Here we outline a testable prediction of our theory by identifying a measurable, population-level signature of changes in functionally related recurrent synaptic strengths.

In response to a fixed feedforward input the responses of a recurrent circuit implementing stimulus sampling will fluctuate. The alignment of the recurrent circuitry and neuronal stimulus tuning causes a portion of these activity fluctuations to align with the subspace in which stimuli are coded. As an example, consider the sampling implemented by a single recurrent network (Fig. 4A), and suppose the population response fluctuates around its mean position (0° in the example of Fig. 8A), ignoring fluctuations along other directions in neuronal response space. The activity of neuron pairs with stimulus preference both above or below the mean position are positively correlated (the black and blue neurons in Fig. 8A), while the activity of neuron pairs with preferences straddling the mean are negatively correlated (the black and red neurons in Fig. 8A). Such stimulus sampling generates a covariance component which is proportional to the outer product of the derivative of neuronal tuning (Fig. 8B), i.e., , where denotes the derivative of tuning **f** (*s*) = ⟨*λ*_{t} ⟩ (mean firing rate) over stimulus *s*. Such noise correlations have been referred to as *differential correlations* [4, 17], and are generally viewed as deleterious to stimulus coding. Stochastic sampling in coupled networks (Fig. 6A) produces similar differential noise correlations (see Supplemental Information).

In our network implementation of sampling, the amplitude of internally generated differential correlations is not arbitrary, but is determined by the recurrent connection strength, . Here, the differential covariance matrix of population responses has the form (see Eq. 44)
where is the variance of in equilibrium over time, and is the mean of the instantaneous conditional distribution (Eq. 4a) represented by the position of instantaneous firing rate *λ*_{t} (Fig. 2B). Importantly, the amplitude of differential correlations increases with the recurrent weight, , which is set by the prior precision Λ_{s} (Eq. 6; Fig. 8C). Thus, in our framework internally generated differential correlations are a by-product of inference by sampling from posterior distributions of stimuli in a structured world.

### Distinguishing external and internal differential correlations

The previous analysis of internally generated differential correlations in a circuit implementing sampling-based inference is based on the assumption of a fixed feedforward input (Eq. 9). However, in typical neurophysiology experiments an external stimulus, *s*, is fixed, while the feedforward input, **u**^{f}, fluctuates due to variability in sensory acquisition and transmission noise (Eqs. 3 and 7). Hence, differential correlations of neuronal population responses are a combination of correlations inherited from feedforward input [45], and correlations generated by recurrent network interactions that align with the population stimulus tuning [24]. When the feedforward input is described by a hierarchical generative model (Eq. 2), the total magnitude of differential correlations in the evoked response is (see Eq. 46), where the second term reflects differential correlations inherited from the feedforward input (compare with Eq. 9). Although the two sources of differential correlations are intertwined in the neuronal response, they impact the information content differently thus offering a potential way to distinguish between them in neural data.

Externally generated differential correlations decrease with feedforward input rate which could be modulated by visual stimulus strength such as contrast (Fig. 8D, red curve). As a consequence, the mutual information (the information between feedforward inputs **u**^{f} and the latent variables, i.e., *s* and *z*, sampled by recurrent network in Fig. 4A, Eq. 42) increases with feedforward input intensity (Fig. 8A, blue curve). We therefore have a monotonic, decreasing relationship between externally generated differential correlations and mutual information. This is expected since such inherited correlations always impair information processing, as observed previously [4, 17]. In contrast, an increase in recurrent weights, *w*_{E}, increases internally generated differential correlations, but results in a non-monotonic change in mutual information (Fig. 8B). Hence there is a non-monotonic relation between internally generated differential correlations and the mutual information between stimulus and feedforward inputs. In sum, the impact of external and internal differential correlations on stimulus coding can be distinguished by their respective monotonic and non-monotonic relation with the mutual information between stimulus and response.

## Discussion

We have presented a framework in which neuronal response variability and recurrent synaptic connections, two ubiquitous features of cortex, are jointly used to implement sampling-based Bayesian inference in neuronal circuit models. Combining mathematical analysis and network simulations we established that stereotypical Poisson variability of discrete spike counts can drive flexible sampling from a family of continuous distributions. The sampling statistics are determined by the structure of recurrent coupling, which stores information about the stimulus prior, and feedforward inputs which convey the stimulus likelihood. Sampling-based inference is implemented in two steps: the instantaneous firing rate, determined by the sum of feedforward and recurrent inputs, represents the instantaneous conditional distribution of latent stimulus, while Poissonian variability in spike generation is used to generate a random stimulus sample from this conditional distribution. A simple circuit model is able to generate samples from multi-dimensional posteriors of latent variables organized hierarchically or in parallel, which underlies the computational basis of a wide range of perceptual and cognitive processes [46].

### Comparison with other neural coding frameworks

The neural code we described shares some features with codes described in previous studies, including parametric representations in probabilistic population codes (PPCs) [15, 36, 37], and samplingbased codes (SBCs) [16, 27–32]. In our framework the conditional distributions of latent variables is represented by instantaneous firing rates which linearly encode the logarithms of these conditional distributions, and have a mathematical form that is similar to that used in past studies describing PPCs (e.g., Eq. 5). Further, the posterior is represented by stimulus samples generated through a random process, a feature of all SBCs. Despite these similarities, there are fundamental differences between the neural code we described and previously proposed PPCs and SBCs.

PPCs are generally implemented in networks with no internally generated variability, with stochasticity inherited from the stimulus. In contrast, our proposed network is doubly stochastic: The Poisson variability in the feedforward input allows a single realization of the feedforward input to represent the whole stimulus likelihood [36], while internally generated Poisson variability drives stimulus sampling. Further, in PPCs the posterior is represented parametrically by a one-shot neuronal response, while in our proposed network the joint posterior is approximated by a sequence of samples, each obtained as a linear readout from the instantaneous neuronal responses. Although it takes time to collect sufficient samples to approximate the posterior, a computational benefit compared with PPCs is that inference of a multivariate posterior can be implemented by linearly coupled networks (Fig. 6), while in PPCs nonlinear coupling between networks is required [47].

Conventional SBCs are used to generate samples directly in a neural space whose dimension is given by the number of neurons in the population [16, 27, 28, 30–33], where a neuronal response, **r**_{t}, is interpreted directly as a sample from the (marginal) posterior of neuronal responses, *p*(**r**). Hence the posterior mean is the temporally averaged population response, and the covariance of population responses is the posterior covariance. In contrast, our proposed network generates samples in a low dimensional stimulus subspace embedded in high dimensional neural activity space. The linear projection of network activity, **r**_{t}, onto the stimulus subspace represents a sample from the stimulus posterior, similar to a previous study [29]. A computational benefit of sampling in a low dimensional stimulus subspace is convergence speed, as the volume of the stimulus subspace is significantly smaller than that of the neural activity space. Indeed, in our examples sequences of samples generated by a single recurrent network (Fig. 4) and coupled networks (Fig. 6) can both converge to an equilibrium distribution in less than 20ms, which is fast enough to complete inference on a behaviorally relevant time scale (Fig. S6). Furthermore, the multiplication of probability distributions of latent stimulus, which is central to Bayesian inference (e.g., cue combination, decision making, see review in [15]), can be implemented by summing the inputs to a neuronal population (Eq. 5). This follows from the fact that the instantaneous population input (or firing rate) linearly encodes the logarithm of a probability distribution (Eqs. 1 and 5). In contrast, producing samples in neural activity space using conventional SBCs requires nonlinear operations in neural circuits in order to multiply probability distributions (or histograms) of the samples [15].

A recent study demonstrated that an E-I recurrent network of rate-based neurons can be numerically optimized for sampling-based Bayesian inference [32]. In contrast, we used a theoretical approach to derive a network model of simplified spiking neurons which implements sampling-based inference. This allowed us to explicitly describe the putative neural mechanisms needed for such sampling. Although the two studies use different generative models and neural representations, the network models in both studies share some common characteristics: ring structure, Poisson-like response variability, and tuning-dependent noise correlation (Fig. S1D). This implies that the seemingly different generative models and neural representations in the two studies reflect more general principles, as suggested in [48]. It will be interesting to extend our theoretical approach to dynamical spiking neurons to determine how the timescales of neuronal dynamics and neuronal oscillations impact inference in rich, dynamic sensory scenes (see below).

### Testing the prediction that recurrent synaptic strength is determined by correlations between latent stimuli

Differential noise correlations generated by recurrent network interactions are a signature of network sampling in our framework (Fig. 5C and 8C). This is in contrast to earlier studies where differential correlations were inherited from feedforward inputs [17, 49]. While internally generated differential correlations could also emerge from a recurrent circuit which is not implementing inference [22, 24, 49–52] or implementing inference via other algorithms [53], in our framework the relation between the magnitude of internally generated differential correlations, the posterior uncertainty, and the strength of the recurrent synaptic weight (Eq. 9) provides a clear test which can be used to verify our proposed circuit mechanism of sampling-based inference. One possible experimental approach would modulate the functional recurrent strength by using a perceptual learning task. Specifically, after using a reference stimulus set with a prescribed correlation between latent stimuli to fully train an animal, we expect that recurrent synaptic weights will strengthen or weaken to improve inference (Fig. 8E, dashed line). This will result in a fixed value of differential noise correlations in the population response due to the recurrent circuitry. Re-training with a stimulus set that has more (less) correlated latent stimuli compared to the reference set will cause the recurrent weights to increase (decrease) (Fig. 8E, red line). When the reference stimulus set is again used to drive task behavior, then performance (as a proxy of mutual information) will decrease, regardless of whether differential correlations have increased or decreased compared to those resulting from the reference stimulus set (Fig. 8E, arrows). In brief, the non-monotonic relationship between differential noise correlations and the mutual information between stimulus and responses which support Bayesian inference offers a clear (and falsifiable) experimental prediction.

### Extensions of circuit-based Bayesian inference

Implementing sampling-based inference in our proposed network requires that feedforward and recurrent inputs have the same tuning profile over the stimulus (Eq. 5). This assumption is supported by experiments in layers 4 and 2/3 in mouse V1 [8]. Moreover, the recurrent connections in our network model are translation-invariant in the stimulus subspace, an assumption widely used in continuous attractor networks (CAN) [22, 51, 54, 55]. Translation-invariant connections simplify the mathematical analysis, but are not required for a circuit to implement sampling. Adding randomness in recurrent connectivity only increases the variance of the sampling distribution. In the past, CANs have been shown to achieve maximal likelihood estimation (point estimate) via template matching [15, 55, 56]. Here we have shown that a network with CAN-like structure and internally Poisson spiking variability is able to perform sampling-based Bayesian inference. In our network correlations in the stimulus prior are represented by the strength of recurrent synaptic activity, which implies that the (subjective) prior precision in the network increases with the feedforward input strength. To maintain a fixed prior in the network recurrent weights need to decrease with increased feedforward input strength which encodes the likelihood precision, Λ_{f} (Eq. 6). Therefore, the (subjective) prior stored in the network with fixed recurrent weights may differ from the objective stimulus prior in the world (Λ_{s} in Eqs. 3 and 7) with feedforward inputs of different strengths. This could be solved by short-term synaptic depression which decreases the synaptic efficacy at increased neuronal firing rates [57]. On the other hand, since the proposed recurrent circuit is general, this result may explain the origin of inductive bias [58] or confirmation bias [59] in cortical processing. Another possibility is that the recurrent circuit represents a more complex generative model which better captures the statistical structure of natural stimuli [30, 32, 60]. We only considered sampling driven by spiking variability with a Fano factor of 1, while cortical responses often have Fano factors that differ from 1 [61, 62]. In the latter case, our theory can still work by changing the feedforward connection weight to compensate for the change in Fano factor, as suggested in a recent study [63].

To keep our exposition transparent we only presented models with minimal complexity. Our proposed network mechanism of sampling-based inference can be generalized to more complex generative models, since the assumption of Gaussianity (Eqs. 21 and 22) and the analytical expression in Eq. (24) are not essential, and several relaxed frameworks may be explored. First, similar networks can generate samples from other multi–dimensional distributions where the conditional distribution of each latent variable belongs to the linear exponential family [35, 36]. This could be done by changing the tuning functions of neurons to another appropriate profile, as the logarithm of tuning determines the type of sampling distribution (Eq. 1). When sampling from non-Gaussian distributions, the stimulus samples can be linearly read out with the weight determined by the tuning profile (i.e., **h**(*s*) in Eq. 1, [36]). Second, the tuning of recurrent inputs does not need to be the same as that of feedforward inputs. Instead the logarithm of recurrent input tuning can have a form of the conjugate prior with the likelihood conveyed by feedforward inputs. Third, the network model could also be used to infer the latent variables with a non-uniform marginal prior, if, for example, the preferred stimuli of neurons in the population are not distributed uniformly in the stimulus subspace [64]. Lastly, we considered only non-structured inhibition for simplicity. Structured inhibitory connections could modulate the position of excitatatory responses in the stimulus subspace, i.e., the mean of the conditional distribution. Such interplay between E and I neurons with structured inhibition has the potential to implement Hamiltonian sampling, where the I neurons represent the sample of auxiliary variables [33, 35].

In conclusion, we have shown that a recurrent circuit of neurons with Poisson spiking statistics can implement sampling from a family of multivariate posterior distributions, with internal spiking variability driving the generation of stimulus samples, and the recurrent connections representing the stimulus prior. The proposed neural code may help us understand the structure of neuronal activity, provide a building blocks for more complicated population computations.

## Methods

### A linear network of excitatory neurons

We study how a generic recurrent network model consisting solely of *N*_{E} excitatory (E) neurons with Poisson spiking statistics (no inhibitory neurons) can implement sampling-based Bayesian inference to approximate the stimulus posterior. We describe neuronal activity using a time– discretized Hawkes process (a type of multivariate, inhomogeneous Poisson process [65]). The instantaneous firing rates of the neurons in the network at time *t*, *λ*_{t}, obey the following recurrent equations:
where **u**^{f} is the feedforward Poisson spiking input (described below; Eq. 18), is the continuous valued recurrent input at time *t*, and *ξ*_{t} is a *N*_{E} dimensional independent Gaussian white noise. Hence, over each time interval [*t* − Δ*t, t*] the activity of the neurons in the network is modeled by a vector of independently generated Poisson spike counts, **r**_{t}, with means determined by the rates *λ*_{t}. The parameters *w*_{E} and *σ*_{r} determine the excitatory recurrent weight and recurrent variability, respectively.

### Poisson spike generation samples stimulus

Independent Poisson spike generation in the network whose activity is described by Eq. (11) can drive sampling across time or across trials from a conditional stimulus distribution determined by the instantaneous firing rate *λ*_{t}. Below we compute the distribution of stimulus samples given *λ*_{t}. We assume that the instantaneous firing rate, *λ*_{t}, has a smooth bell-shaped profile and can be parameterized as,
where characterizes the position of the population firing rate on the stimulus subspace (Fig. 1B, x-axis), while *R* and *a* denote the height and width of the population firing rate, respectively. Further, *θ*_{j} is the preferred stimulus value of neuron *j*, and the preferred stimuli of all neurons, , are uniformly distributed over the range of stimulus *s* (Fig. 1B).

To simplify the analysis, we first assume that the instantaneous firing rate is fixed over time. When generating Poisson spikes **r**_{t} from *λ*_{t}, the probability of observing a stimulus sample (embedded in **r**_{t}) can be derived as (see details in Supplemental Information),
where *n*_{r} = ∑_{j} **r**_{tj} is the number of emitted spikes across the whole neural population, and *n*_{λ} = ∑_{j} ⟨*λ*_{j} ⟩ Δ*t* is the sum of population firing rate. Here 𝒩 (*s*|*μ, σ*^{2}) denotes a Gaussian distribution with mean *μ* and variance *σ*^{2}, and is a vector with the *j*^{th} element as shown in Eq. (12). The logarithm of the firing rate profile, , determines how the stimulus sample and its mean, , can be read out respectively from **r**_{t} and *λ*_{t},
where and characterizes the position of **r**_{t} and *λ*_{t} on the stimulus subspace.

The sampling variability of in a single time step depends on the number of emitted spikes, *n*_{r}. When the fixed rates, *λ*_{t}, repeatedly generate spikes over time, the sampling distribution of can be calculated by marginalizing the likelihood (Eq. 13, last line) over different values of *n*_{r} since *n*_{r} varies across time (detailed calculation by using Laplacian approximation can be seen in Supplemental Information),
Each stimulus sample, , is thus drawn from a conditional distribution determined by the instantaneous firing rate, , and can be written as
The last proportionality in the above equation is satisfied by a Gaussian profile in the firing rate (more general derivation can be found in Supplemental Information). Introducing Λ = *a*^{−2}*n*_{λ} gives Eq. (1) shown in the main text.

Eq. (16) suggests that the type of sampling distribution (or the conditional distribution) that is obtained from spike generation variability is determined by the profile of the instantaneous firing rate, i.e., (Eq. 12). Although the sampling distribution belongs to the linear exponential family of distributions which is similar with the probabilistic population code (PPC) [36], there are different ways in representing these distributions. In PPCs the likelihood over is parametrically represented by a single realization of independent neuronal response **r** (Eq. 13), while in our work the distribution is approximated by a sequence of samples, , effectively generated by conditionally independent Poisson spike discharges.

The above analysis can be extended to the case where the instantaneous firing rate, *λ*_{t}, in a time step deviates from a smooth Gaussian profile (Eq. 12), which is the case in the actual network simulations. In general, *λ*_{t} can be expressed as,
where *δ*_{⊥}*λ*_{t} denotes the deviation from a smooth Gaussian profile. Note that the sampling distribution only depends on the position, , and the sum of instantaneous firing rate, *n*_{λ} (Eq. 16), which corresponds to two perpendicular directions in the *N*_{E} dimensional space of *λ*_{t}. For any instantaneous firing rate vector, *λ*_{t}, we can always find and *R*_{t} that make the deviation *δ*_{⊥}*λ*_{t} perpendicular to the two directions, i.e., ∑_{j} *δ*_{⊥}*λ*_{tj}*θ*_{j} = 0, and ∑_{j} *δ*_{⊥}*λ*_{tj} = 0. This observation imples that deviations from Gaussian firing rate profiles do not affect our theory.

### Feedforward spiking input conveys the likelihood of stimulus

We model the feedforward inputs to the E neurons in the network, **u**^{f}, as independent Poisson spikes, with Gaussian tuning over stimulus *s*,
Here denotes the feedforward input received by the *j*^{th} E neuron, and is the tuning of the feedforward input. This mathematical description of feedforward input is the same as the one used in the definition of typical PPCs [15, 36, 37]. Since the preferred stimulus values, , of all feedforward inputs are uniformly distributed in stimulus space then the likelihood of *s* given a single observation of the input, **u**^{f}, satisfies [36, 37],
The logarithm of tuning, **h**(*s*), determines the type of likelihood [15]. Specifically, the Gaussian tuning leads to a Gaussian likelihood (Eq. 19), whose mean, *μ*_{f}, and precision, Λ_{f}, are both linear functions of the inputs,
The mean, *μ*_{f}, represents the position of **u**^{f} in stimulus subspace, and the precision, Λ_{f}, is proportional to the sum of total feedforward spike counts, *n*_{f}.

### A recurrent network samples hierarchical latent variables

#### A hierarchical generative model

We consider a hierarchical generative model for which inference can be implemented in a recurrent circuit of Poisson neurons. We extend the simple generative model of feedforward input (Eq. 19) by considering the stimulus *s* to depend on a one dimensional context variable, *z*. For simplicity, we assume that *z* follows a uniform distribution (Fig. 3B, marginal plots)
where 𝒰(*a, b*) denotes a uniform distribution over [*a, b*]. The assumption of a uniform prior, *p*(*z*), simplifies our model significantly, as it implies the spatial homogeneity of the network model as given by Eqs. (18-19). However, this assumption is not essential for our main results. Due to the differences between the stimulus (local) and context (global) aspects of the sensory scene, the stimulus, *s*, is not identical to the context *z*, but we assume that the two are correlated, so that
In sum, the whole generative model is determined by,
where *p*(**u**^{f} |*s*) is the same as in Eq. (19).

#### Approximate Bayesian inference via Gibbs sampling

The joint posterior of stimulus and context can be analytically derived given the generative model (Eq. 23), We will use this expression to verify that the samples produced by our algorithm converge to ththe output of the algorithm.

We use the stochastic response of our recurrent network (Eqs. 10-11), as a basis for Gibbs sampling [31, 35, 41] (a type of Monte Carlo method) to approximate the joint posterior of stimulus, *s*, and context, *z*. To describe the iterative Gibbs algorithm, we assume that a context sample, , is provided at time *t*, which is then combined with the feedforward input to update the conditional distribution of stimulus *s* (step 1 in Fig. 3C),
The next step in the algorithm is to draw a sample, , from the conditional distribution (step 2 in Fig. 3C),
Next, the conditional distribution of context, *z*, is updated given this new sample, , and a new sample, , is drawn (step 3 in Fig. 3C),
These three steps in the Gibbs sampling algorithm (Eqs. 25-26) are performed iteratively until sufficiently many samples, and , are generated to approximate the true posterior distribution with sufficient accuracy (Fig. 3D; compare the red dots with the blue contour map).

#### Implementing the Gibbs sampling in a recurrent circuit model

Gibbs sampling of the stimulus (Eq. 4b) can be implemented via independent Poisson spike generation, as long as the conditional distribution encoded in *λ*_{t} (Eq. 16) is the same as the conditional distribution in the Gibbs sampling algorithm (Eq. 4a), i.e., . This condition can be realized in the recurrent circuit by relating the expressions describing the neural dynamics (Eq. 10) and those describing the Gibbs sampling distribution (Eq. 4a) to yield,
The generative model for the feedforward input **u**^{f} (Eq. 19) suggests that . Hence to satisfy Eq. (27) we require
which implies that the recurrent input, , should approximately have a Gaussian profile,
whose position on the stimulus subspace is , and the sum of input (height) is determined by Λ_{s}, the precision of conditional distribution . In a similar fashion to Eq. (17), denotes the deviation from a smooth Gaussian and is perpendicular to the direction of and Λ_{s}.

The optimal recurrent weight can be derived by combining Eq. (29) and Eq. (17). We notice the recurrent input, **u**^{r}, and neuronal responses, **r**_{t}, have the same tuning width, *a*, in a network with only E neurons. This can only be achieved if E neurons are only self-connected (Eq. 10), as lateral connection broaden their tuning. The optimal recurrent weight generating recurrent input with appropriate strength is then,
which yields Eq. (6) in the main text. Note that the self-connection is a result of the simplifying assumption that the network consists solely of E neurons (Eq. 10), which can be relaxed in a full network consisting both E and I neurons as we show below.

The sampling of the context variable (Eq. 4c) can be implemented through variability in the recurrent input. To do this, we include diffusive term in the recurrent interactions, , and we equate the variance of the fluctuations with the mean to mimic a Poisson distribution:
where [·]_{+} denotes negative rectification. Here *ξ*_{t} is a *N*_{E} dimensional Gaussian white noise with ⟨*ξ*_{t}(*i*)*ξ*_{t′} (*j*) = *δ*_{ij}*δ*(*t* − *t*′), *δ*_{ij} and *δ*(*t* − *t*′) are Kronecker and Dirac delta functions respectively, represents the conditional distribution , and represent a context sample (Eq. 29). The multiplicative variability on recurrent interaction may come from synaptic noise [66, 67].

### Coupled circuits sample a multi-dimensional posterior

We consider a generative model which has multiple latent stimuli, **s** = (*s*_{1}, *s*_{2}, · · ·, *s*_{m}), which are organized in parallel (Fig. 6A). Without loss of generality, we consider the simplest case where *m* = 2, and the same mechanism can be straightforwardly extended to any *m >* 2. We assume the joint prior of **s** is a multivariate normal distribution,
and each stimulus *s*_{m} is uniformly distributed in (−180°, 180°] with periodic boundary imposed. The definition of Gaussian distribution in a circular space works well as long as the variance of the distribution is much smaller than the range of stimulus space. Here **Λ**_{s} is the precision matrix, while the scalar variable Λ_{s} (Λ_{s} ≥ 0) characterizes the correlation between *s*_{1} and *s*_{2}. Note that the covariance matrix is not defined, and the prior (Eq. 32) is improper. The mean, *μ*_{s}, is a free parameter, because it doesn’t appear in the detailed expression of the prior (Eq. 32), which is a consequence from the zero determinant of the precision matrix, i.e., |**Λ**_{s}| = 0. A further consequence is that the prior is not centered at *μ*_{s}, but instead has a band structure along the diagonal, and the marginal prior of each stimulus feature *p*(*s*_{m}) (*m* = 1, 2) is uniform (Fig. 6B). The uniform marginal prior simplifies our theoretical derivation as it implies the spatial homogeneity of the network model but doesn’t impact the proposed neural coding mechanism.

Each stimulus *s*_{m} (*m* = 1, 2) individually generates feedforward spiking input , whose likelihood is exactly the same as Eq. (2). Combined together, the generative model is
where *μ*_{f} = (*μ*_{f1}, *μ*_{f2})^{⊤}, and the likelihood precision matrix **Λ**_{f} = diag(Λ_{f1}, Λ_{f2}) is a diagonal matrix.

#### Gibbs sampling of the multi-dimensional posterior in a coupled neural circuit

Given the generative model (Eq. 33), the joint posterior of *s*_{1} and *s*_{2} is a bivariate normal distribution, i.e., , whose precision matrix **K**_{p} and the mean *μ*_{p} are,
The precision matrix of the posterior is the sum of the precision of the likelihood and the prior, implying increased reliability of the distribution after combining with the prior. Meanwhile, the posterior mean is the weighted average of the means of the two likelihoods, with the weight proportional to the precision of each likelihood. We use this expression for the posterior to evaluate the performance of the proposed sampling-based algorithm.

Using Gibbs sampling to approximate the posterior (Eq. 34) involves the following steps:
We note that we only describe the sampling from the posterior distribution of *s*_{1}; as samples from the posterior of *s*_{2} can be obtained similarly after exchanging indices. This sampling can be implemented in a neural circuit model consisting of several coupled networks, in which each network generates samples from the posterior distribution of the corresponding stimulus. Therefore the number of networks in the coupled circuit equals the dimension of the latent stimuli. The dynamics of the coupled neural circuit is defined by:
We again note the dynamics of network 2 can be similarly obtained by changing indices. To implement Gibbs sampling (Eqs. 35a-35b) in the coupled circuit (Eqs. 36-37), spike generation in network 1 (Eq. 37) can be used to produce stimulus samples, , when the conditional distribution determined by *λ*_{1t} matches the conditional distribution required in the definition of Gibbs sampling (Eq. 35a), i.e., . Taking the logarithm of Eq. (35a) yields,
Comparing this expression with Eq. (36), we see that the feedforward input, , matches the conditional distribution (Eq. 33). We therefore require the recurrent input from network 2 to network 1 to encode the conditional distribution , i.e., . This implies that should approximately have a Gaussian profile,
where quantifies the deviation from a perfect Gaussian profile, and does not affect the decoded value and Λ_{s}.

The recurrent input, , (Eq. 39) has the same width *a* as the neuronal response, **r**_{1}. In circuit containing only E neurons, if the two networks have the same number of neurons, then across networks only neurons having the same preferred stimulus should be connected. The optimal recurrent weight between two networks is then
Since each network individually generate a stimulus sample, the sample of stimulus *m* can be locally read out from network *m*’s responses even if the activities of two networks are correlated (Fig. 6A), which greatly simplifies readout. Furthermore, due to the population firing rate of each network has Gaussian profile, the stimulus sample can be linearly read out from **r**_{mt} as
We note that the circuit implementation of Gibbs sampling from a multi-dimensional posterior (Eq. 8a) does not require the recurrent connections between E neurons within a network. This is due to the assumption that the marginal priors of each stimulus feature, *p*(*s*_{m}), are uniform. For a non-uniform marginal prior *p*(*s*_{m}), recurrent connections between E neurons within a network would be required for generating samples from a distribution that matches the true posterior.

### Inference from an information-theoretic point of view

The goal of the sampling algorithm is to approximate the posterior distribution of a latent variables, Θ, given a feedforward input, **u**^{f}. Specifically, the latent variables Θ = {*s, z*} in the hierarchical generative model (Eq. 23), or Θ = **s** = {*s*_{1}, *s*_{2}} in the generative model with breadth (Eq. 33). When the sampling algorithm uses an internal model which does not match the structure of the generative model, the sampling distribution *q*(Θ|**u**^{f}) will differ from the true posterior, *p*(Θ|**u**^{f}) (Eq. 24). In this case the mutual information between the sampling distribution of the latent variables, Θ, and **u**^{f} will be smaller than in the case when samples come from the true posterior, *p*(Θ|**u**^{f}),
It is straightforward to show that the difference between *I*(Θ, **u**^{f}) and *I*_{q}(Θ, **u**^{f}) is the Kullback-Leibler (KL) divergence between *p* and *q*, i.e., *D*_{KL}[*p*||*q*] = *I*(Θ, **u**^{f})−*I*_{q}(Θ, **u**^{f}) = 𝔼_{p}(ln *p*−ln *q*) ≥ 0. Equality in Eq. (42) holds only if the distribution *q* matches the true posterior *p*.

The mutual information *I*_{q}(Θ; **u**^{f}) can be computed analytically when the approximating distribution is a bivariate normal (substituting Eqs. 23 and 24 into Eq. 42),
Here *L* = 360° is the length of the stimulus feature subspace, while *μ*_{p} and **K**_{p} are the mean and the precision matrix of the posterior distribution (Eqs. 24 or 34). When *q* matches the posterior distribution, *p*, we have, .

### The neuronal response distribution conditioned on external stimulus

We compute the distribution of neuronal responses **r** over time/trial in response to an external stimulus *s*, i.e., *p*(**r**|*s*), in order to find a neural signature of network sampling and compare it with experimental data. For a fixed external stimulus *s*, the neuronal response **r** fluctuates due to both sensory transmission noise described by *p*(**u**^{f}|*s*) (Eq. 18), as well as the internally generated variability described by *p*(**r**|**u**^{f}) (Fig. 4A). Therefore, the distribution of **r** in response to an external stimulus *s* has the form
For simplicity, we only compute the covariability of *p*(**r**|**u**^{f}) along the stimulus subspace (Fig. 1B, x-axis), because the covariability along other directions is not related with stimulus sampling. By approximating the Poissonian spiking variability *p*(**r**|** λ**) with a multivariate normal distribution (Eq. 11), and considering the limit of weak fluctuations in

**along the stimulus subspace over time,**

*λ**p*(

**r**|

**u**

^{f}) can be computed approximately as (see math details in Supplemental Information),

**f**(

*s*) = ⟨

*λ*_{t}⟩ denotes the temporally averaged population response. The covariance structure of the neuronal response includes two terms: diag(

**f**(

*s*)), a diagonal matrix whose entries equal that of the vector

**f**(

*s*) denoting the (independent) Poisson spiking variability (Eq. 23), and , a term that captures the covariability due to firing rate fluctuations along the stimulus subspace (Fig. 8A), where is the derivative of

**f**(

*s*) over the stimulus feature

*s*. The covariance is often termed differential (noise) correlations [4, 17]. With the Gaussian profile of

**f**(

*s*) (Eqs. 18 and 29), exhibits anti-symmetric structure (Fig. 8B) [17, 22, 50, 68, 69].

In Eq. (44), is the variance of (the mean of conditional distribution in Eq. 4a) over time and characterizes the amplitude of internally generated differential correlations. In network implementation, and *μ*_{f} are represented as the position of *λ*_{t} and **u**^{f} on the stimulus subspace respectively (Eqs. 14 and 20). The dynamics of Gibbs sampling (Eq. S20 in Supplemental Information) and the network structure (Eq. 6) imply that
Note that is constrained by network connections, in that it is internally generated and shared within the network (for ).

An expression for *p*(**r**|*s*) can be derived similarly, and includes an additional term contributing to differential correlations compared with *p*(**r**|**u**^{f}) (Eq. 44) due to fluctuations in the feedforward inputs,
Here the variance, , in the stimulus feature subspace is a mixture of internal variability, , and sensory noise, *V* (*μ*_{f}|*s*) (Eq. 23). The neuronal response distribution in coupled networks (Fig. 6A) can be obtained similarly (see the Supplemental Information).

### A spiking network model with excitatory and inhibitory Poisson neurons

To test the proposed inference mechanisms in a network consisting of E neurons (Eqs. 10-37), we simulated a well studied recurrently coupled cortical model [21, 22]. The network consisted of *N*_{E} excitatory (*E*) and *N*_{I} inhibitory (*I*) spiking neurons, with the activity of each neuron modeled as a Hawkes process [65]. At time *t*, we represent the response of neuron *j* in population *a* = {*E,I*}, , as a spike count drawn from a Poisson distribution with instantaneous firing rate, ,
Each neuron has a refractory period of 2ms after emitting a spike. The firing rate is the sum of feedforward input and recurrent input , so that . The feedforward inputs are filtered spikes from upstream neurons, , where is the time of the *n*^{th} spike received by neuron *j* of population *a* from the feedforward inputs. Here *η*(*t*) is the synaptic input profile which is modeled as *η*(*t*) = exp(−*t/τ*_{d})*/τ*_{d}, (*t >* 0). Throughout, we set the synaptic time constant *τ*_{d} = 2ms. To mimic the Poisson-like variability to sample a context in a hierarchical generative model (Eqs. 23 and 31), the recurrent input received by neuron *j* in population *a* is defined by
where is the mean recurrent input at time *t* given the neuronal activities of the presynaptic neurons. The recurrent input in the network is corrupted by noise whose variance equals the mean of the recurrent input. In a physiological network, recurrent noise may be generated by the chaotic state in network dynamics [70] or synaptic noise [66, 67]. In Eq. (48) the function [·]_{+} rectifies the negative input, and *ξ*_{t} is a random variable following a standard Gaussian distribution. The coefficient is the synaptic weight from neuron *j* in population *b* to neuron *i* in population *a*. The time is the time of the *n*^{th} spike fired by neuron *k* in population *b*. The parameter *N* = *N*_{E} +*N*_{I} is the total number of neurons in the network. The scaling of the synaptic weights by is standard in networks where excitation is balanced by recurrent inhibition [70]. Finally, the synaptic input profile of the recurrent input, *η*(*t*), is the same as the one we chose for the feedforward input for convenience. Note that the rectification in Eq. (48) on recurrent inputs will introduce errors resulting in deviations of the sampling distribution from the true posterior, and hence we chose the recurrent weights to be small (Fig. 5). The rectification only arises when using (continuous) recurrent inputs to sample the context variable, and doesn’t impact the generality of sampling by (discrete) Poisson spiking variability.

To model the coding of a circular stimulus such as orientation, the excitatory neurons are arranged on a ring [22, 68]. The preferred stimuli, *θ*_{j}, of the excitatory neurons are equally spaced on the interval (−180°, 180°], consistent with the range of latent features (Eq. 21). Inhibitory neurons are not tuned to stimulus, and their role is to stabilize network responses. Note that the recurrent connections between *E* neurons are modeled using a Gaussian function decaying with the distance between the stimuli preferred by the two cells, rather than only self-connection in the simple network with only E neurons (Eqs. 30),
We imposed periodic boundaries on the Gaussian function to avoid boundary effect in simulations. Although in the generative model we assumed non-periodic feature variables (Eq. 3), as long as the variance of the associated distributions are smaller than the width of the feature space, the network model with periodic boundaries on the recurrent connection (Eq. 49) provides a good approximation of the non-periodic Gaussian posterior (Eq. 24). The weight *w*_{EE} denotes the average connection strength of all *E* to *E* connections. The parameter *a* = 40° defines the footprint of connectivity in feature space (i.e the ring), and *L* = 360° is the length of the ring manifold (Eq. 21); Multiplication by *L* in Eq. (49) sets the sum of all *E* to *E* connection strengths equal to *N*_{E}*w*_{EE}. Moreover, the excitatory and inhibitory neurons are all-to-all connected with each other (similar for *I* to *I* connections). For simplicity, we consider the *E* to *I, I* to *I* and *I* to *E* connections all to be unstructured (in feature space) and assume that connections of the same type have equal weight, i.e., and . To simplify the network further, we consider the connections from the same population of neurons to have the same average weight, i.e., *w*_{EE} = *w*_{IE} ≡ *w*_{E} and *w*_{II} = *w*_{EI} ≡ *w*_{I}. For the feedforward network model shown in Fig. 2, we only remove the E recurrent connections between E neurons, i.e., *w*_{EE} = 0, while keeping other connections, including *w*_{EI}, *w*_{II}, and *w*_{IE}, the same as the recurrent network.

The feedforward inputs applied to E neurons consist of independent Poisson spike counts as described by Eq. (18), with rate . The inhibitory neurons also receive feedforward indpendent Poissonian inputs. The firing rate of the input received by every *I* neuron is proportional to the overall feedforward rate of input to *E* neurons, in order to keep the excitatory and inhibitory balance of neuronal activities in the network,
In the simulations, we started with a network of *N*_{E} = 180 excitatory and *N*_{I} = 45 inhibitory neurons, and increased the number of neurons by a fixed factor in Fig. 1D. The ratio between the average connection from *I* neurons and the one from *E* neurons was kept fixed with *w*_{I}*/w*_{E} = 5. We set the feedforward weight of input to I neurons to *w*_{If} = 0.8. We simulated the dynamics of the model network using the Euler method with a time step of 0.1ms. The typical parameters used in simulation can be found in Table 1 in Supplemental Information. Further details about the simulations and numerical estimates of mutual information and linear Fisher information are also presented in Supplemental Information. The code of network simulation was written in MATLAB 2018b, and can be found at GitHub (https://github.com/wenhao-z/Sampling_PoissSpk_Neuron).

#### A spiking network model of coupled neural circuits

In the coupled neural circuits used to infer latent variables organized in parallel (Fig. 6A) the two networks are copies of each other, i.e., the two networks have the same intrinsic parameters. Each network is equivalent to the one described in the previous section, except that there is no recurrent connections between E neurons in the same network, and no variability in recurrent interactions (no noise in Eq. 48). The absence of recurrent connections between E neurons in the same network is due to the uniform marginal prior of stimulus. Nevertheless, in the same network the E and I neurons are connected using the same connection profile as above to keep network activity stable. Between the two networks, there are only E connections which target both E and I neurons. The connections between E neurons across networks have the same pattern as that given described by Eq. (49) with the peak connection strength from network *n* to network *m* denoted as . The connections from E neurons in one network to I neurons in the other is set to the same as the peak strength of E connections across networks for simplicity, i.e., . To simplify the network model further, we set the inter-network connections to be symmetric, which means . In the simulations was adjusted to determine how the sampling distribution is affected (Fig. 7A).

#### Comparing the sampling distribution with posterior in coupled neural circuits

We read out the samples from the posterior distribution of each stimulus, , individually from the spiking activities of E neurons, **r**_{mt}, in network *m* in every time window of 20ms by using a population vector. We used this collection of samples to estimate the mean, , and covariance matrix, **Σ**_{s}, of the sampling distribution. Meanwhile, the mean *μ*_{f} and precision matrix **Λ**_{f} of the likelihood are linearly read out from the feedforward inputs fed into the network model (Eq. 33).

If the sampling distribution is comparable with the posterior, the sampling mean and covariance **Σ**_{s} should satisfy Eq. (34). We use the actual sampling covariance and the likelihood parameters to predict the sampling mean, i.e., , and compare it with the actual (Fig. 7D-F). To obtain the posterior precision matrix, given the sampling mean and the likelihood parameters, we vary the single parameter of prior precision Λ_{s} to minimize the KL divergence from the prediction of posterior by using the value of Λ_{s}, and the actual sampling distribution. Given this value of Λ_{s}, the prediction of posterior precision is computed as **K**_{pred} = **Λ**_{s} +**Λ**_{f} (Eq. 34) which is then compared with actual sampling precision matrix (; see Fig. 7C-G). The prior precision, Λ_{s}, is a *subjective* prior, which reflects the prior stored in the recurrent network and may change with input (see Discussion). More details of network simulation and parameters can be found in Supplemental Information.

## Acknowledgements

National Institutes of Health grants 1R01MH115557 (K.J.), 1U19NS107613-01 (B.D.), R01EB026953 (B.D.); National Science Foundation grant NSF-DBI-1707400 (K.J.). Vannevar Bush faculty fellowship N00014-18-1-2002 (B.D); Simons Foundation Collaboration on the Global Brain (B.D.).

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].
- [12].↵
- [13].↵
- [14].
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵