## Abstract

For animals to navigate an uncertain world, their brains need to estimate uncertainty at the timescales of sensations and actions. Sampling-based algorithms afford a theoretically-grounded framework for probabilistic inference in neural circuits, but it remains unknown how one can implement fast sampling algorithms in biologically-plausible spiking networks. Here, we propose to leverage the population geometry, controlled by the neural code and the neural dynamics, to implement fast samplers in spiking neural networks. We first show that that two classes of spiking samplers—efficient balanced spiking networks that simulate Langevin sampling, and networks with probabilistic spike rules that implement Metropolis-Hastings sampling—can be unified within a common framework. We then show that careful choice of population geometry enables rapid inference of parameters drawn from strongly-correlated high-dimensional distributions in both networks. Our results suggest design principles for algorithms for sampling-based probabilistic inference in spiking neural networks, yielding potential inspiration for neuromorphic computing and testable predictions for neurobiology.

## 1 Introduction

Neural circuits perform probabilistic computations at the sensory, motor and cognitive levels [1–3]. From abstract representations of decision confidence [4] to estimates of sensory uncertainty in visual cortex [5, 6], evidence of probabilistic representations can be found at all levels of the cortical processing hierarchy [7]. To be behaviorally useful, these probabilistic computations must occur at the speed of perception [8]. However, how neuronal dynamics allow brain circuits to represent uncertainty in high-dimensional spaces at perceptual timescales remains unknown [9–12].

Several neural architectures for probabilistic computation have been proposed, including: probabilistic population codes [13], which allow a direct readout of uncertainty under some assumptions; direct encoding of metacognitive variables, such as the confidence (the probability of being correct) in a decision [3, 4, 7]; doubly distributional codes [14, 15] which distinguish uncertainty from multiplicity; sampling-based codes [8, 16–23] where the variability in neural dynamics corresponds to a signature of exploration of the posterior probability. Of these approaches, sampling-based codes are rooted in the strongest theoretical framework in the statistics and machine learning literature [24–29], and have been used to perform inference at scale [30–32].

However, it remains unclear how spiking neural networks could implement sampling-based algorithms for fast inference in high dimensional feature spaces. Most experiments quantifying uncertainty representations in single biological neurons have only varied parameters along one or two dimensions, such as in Bayesian cue combination [2, 10]. In these conditions, many algorithms can perform adequately, but it is well-known that once the dimensionality of the problem increases, probabilistic inference becomes slower [33]. Several approaches have been proposed in biologically inspired sampling algorithms to speed-up the convergence of sampling, such as adding non-reversible dynamics [8] or using a distributed code to parallelize the inference [19]. Yet, these proposals have not shown how accurate sampling from high-dimensional distributions at behaviorally-relevant timescales can be achieved using spiking networks.

In this paper, we show how the choice of the geometry of neural representations at the population level [34, 35], set by the neural code and the neural dynamics, can accelerate sampling-based inference in spiking neural networks. Concretely, we leverage recently-proposed methods for accelerating sampling from the machine learning literature to implement efficient samplers in spiking neural networks [33, 36, 37]. The structure and major contributions of this paper are divided as follows:

In §2, we construct from first principles a novel spiking neural network model for sampling from multivariate Gaussian distributions. This model is based on a probabilistic spike rule that implements approximate Metropolis-Hastings sampling. We show that efficient balanced networks (EBN) [19, 38–41] emerges as a limit of this model in which spiking becomes deterministic.

In §3, we show that population geometry enables rapid sampling in spiking networks. Leveraging the “complete recipe” for stochastic gradient MCMC [36], we establish principles for the design of efficient samplers in spiking neural networks. Then, we show how neural population geometry enables fast sampling—on the timescale of tens of milliseconds—in two limits of the model introduced in §2: EBNs in which sampling is driven by stochastic Langevin dynamics [19], and networks in which sampling is driven purely by a Metropolis-Hastings probabilistic spiking rule.

Finally, in §4 we conclude by discussing the implications of our results in the context of prior works, and highlight their limitations as well as remaining open questions. In particular, we comment on possibilities for future experimental studies of sampling in biological spiking networks, and applications to neuromorphic computing.

In total, our results show how spiking networks can perform sampling-based inference at behaviorally relevant timescales.

## 2 Spiking networks for sampling-based probabilistic inference

We begin by proposing a framework for probabilistic inference in spiking neural networks in which the spiking rule implements a Metropolis-Hastings step. We show that EBNs [38–40] can be recovered as a limiting case of this more general framework. Though our general framework is based on Metropolis-Hastings sampling, we will later describe how it can implement accelerated Langevin sampling as a special limiting case.

In this section, we will keep our discussion quite general, but the problem we aim to solve can be given a concrete interpretation in a neuroscience context. The goal of a spiking neural network performing probabilistic inference is to estimate the posterior distribution of *n*_{p} parameters ** θ** given an input

**x**,

*P*(

**|**

*θ***x**) ∝

*P*(

**x**|

**)**

*θ**P*(

**) (Figure 1). The input could correspond to the activity of sensory neurons**

*θ***x**in early sensory processing (e.g. input onto ganglion cells in the retina or onto mitral cells in the olfactory bulb) or inputs into a cortical column that sense features in the environment through an affinity matrix

**A**(see Appendix C.2). In the rest of the paper, we will usually write

*P*(

**) for the sampled distribution, which in a neuroscience setting would stand for**

*θ**P*(

**|**

*θ***x**).

### 2.1 Deriving an approximate Metropolis-Hasting spiking sampler

We will build a network of *n*_{n} spiking neurons to sample an *n*_{p}-dimensional Gaussian distribution of time-varying mean ** θ**(

*t*) and fixed covariance

**Ψ**. We first consider the case in which

**is constant, and then generalize the resulting algorithm to the case in which it is slowly time-varying.**

*θ*As in prior work on probabilistic inference using spiking networks, we take the parameter estimates to be linearly decoded from the filtered spike trains **r** of *n*_{n} neurons [38, 39, 41]. Working in discrete time for convenience and clarity (as in [41]), we let
is the low-pass filtered history of spikes for some decay constant 0 ≤ *η* ≤ 1.

Metropolis-Hastings sampling constructs a Markov chain by drawing a proposed next state from some distribution, and then deciding whether to accept or reject that proposal based on a probabilistic rule [42, 43]. The acceptance ratio is given in terms of the relative posterior probability of the proposed and current states. Here, our proposal distribution will be a distribution over which neuron spikes at a given timestep. In general, the proposal distribution could be computed using a stochastic gradient step followed by an accept/reject step, yielding Metropolis-adjusted Langevin dynamics [44, 45]. In this case, the accept/reject step allows the algorithm to compensate for some of the sampling error introduced by discretization at the expense of needing to compute a likelihood ratio, which can be expensive for high-dimensional distributions.

In trying to build a Metropolis-Hastings sampler [42] using a probabilistic spiking rule, we are immediately faced with two problems. First, the spikes are sign-constrained and discrete. Second, the dynamics of the filtered spike history incorporate a decay term, hence the readout will change even if no spikes are emitted. These conditions mean that the proposal density over will not be symmetric, and that the Markov process will not satisfy the condition of detailed balance [42, 43]. The first problem can be solved by imposing a fine-tuned balancing condition on the readout weights. We assume that the network is divided into two equally-sized populations with equal and opposite readout weights, i.e. that the readout matrix is of the form **Γ** =[+**Z**, −**Z**] for some matrix . This could be accomplished by dividing the total population of neurons into excitatory and inhibitory populations of equal size, with equal and opposite sign-constrained weights. The second problem can be solved by assuming that *η* = 0, i.e., that we have access to a perfect integrator of the spike trains.

At the *t*-th timestep, we choose one neuron *j* uniformly at random, and let the spike proposal be **o**′ = **e**_{j}, where **e**_{j} is the *j*-th standard Euclidean basis vector (i.e., (**e**_{j})_{k} = *δ*_{jk}). More explicitly, the proposed spike is sampled as . This yields a candidate readout
If *η* = 0 and the balance assumption on **Γ** is satisfied, then the proposal distribution is exactly symmetric, in the sense that the probabilities of reaching from and of reaching from are equal. Then, if we accept the proposed spike with probability
we obtain a Metropolis-Hastings sampling algorithm [42, 43].

We now relax the assumption of the perfect integration, and assume only that *η* ≪ 1. With the same proposal distribution as before, we take the acceptance ratio of the accept-reject step to be
This choice has two important features. First, the decay means that the proposal distribution will be asymmetric, and the Markov chain will no longer satisfy the condition of detailed balance [42, 43]. However, the resulting error will be small if *η* ≪ 1. Moreover, by comparing the likelihood of the proposal, *P* [(1 − *η*) **Γr**_{t−1} + **Γe**_{j}], to the likelihood of the next state without the proposed spike but with the decay, *P*[(1 − *η*) **Γr**_{t−1}], this choice implements a sort of look-ahead step that should allow the algorithm to partially compensate for the decay in the rate.

With the choices above, we show in Appendix B.2 that one can write the acceptance ratio as
where
has the interpretation of a membrane potential,
has the interpretation of a spiking threshold, and the recurrent weight matrix is defined as
Thus far, we have assumed that the mean signal is constant. The natural generalization of this algorithm to a time-varying mean signal *θ*_{t} is to take the membrane potential to be
This leads to the voltage dynamics
which, when combined with the probabilistic spiking rule with uniform proposals and acceptance ratio (5), yields our final algorithm. This will not be an exact Metropolis-Hastings sampler unless the mean is constant, the decay term vanishes (*η* = 0), and the readout matrix **Γ** satisfies an exact balancing condition. In particular, if these conditions are violated, the resulting Markov chain will not satisfy detailed balance. However, if they are violated only weakly, one heuristically expects this algorithm will be a reasonable approximation to a true sampler. We will provide empirical evidence for this intuition in §3.

### 2.2 The continuous-time limit

We now consider the continuous-time limit of the model introduced above. This limit corresponds to taking the limit in which spike proposals are made infinitely often, and regarding the dynamics written down previously as a forward Euler discretization of an underlying continuous-time system. For a timestep Δ between spike proposals, we let the discrete-time decay rate be *η* = Δ /*τ*_{m} for a time constant *τ*_{m}, thusly named because it has the interpretation of a membrane time constant. Then, we show in Appendix B.4 that the Δ ↓ 0 limit of the discretized rate dynamics (1) yields the familiar continuous-time dynamics
In continuous time, the spike train **o(***t*) is now composed of Dirac delta functions, as the discretized spikes are rectangular pulses of width Δ in time and height 1/Δ. We next consider the voltage dynamics of the leaky integrator for a varying mean signal (10), which have a similar continuum limit:
In this limit, the rate will decay by only a infinitesimal amount between a rejected spike proposal and the next proposal, meaning that the error incurred by neglecting the asymmetry in the proposal distribution due to the decay should be negligible. We also note that, though the probabilistic spike rule (5) does not explicitly include a reset step, the dynamics (12) prescribe that the *j*-th neuron’s membrane voltage should be decremented by 2*T*_{j} after it spikes.

### 2.3 Efficient balanced networks as a limit of the spiking Metropolis-Hastings sampler

This spiking network samples the posterior distribution by emitting spikes probabilistically, but we can use the same architecture to re-derive EBNs, which approximate continuous dynamical systems using spiking networks [38–40]. If we take (for the *n*_{p} × *n*_{p} identity matrix) the voltage dynamics (12) are identical to those of the EBN.^{1} The greedy spiking rule of the EBN can be recovered in this framework by taking a limit in which the variance of the Gaussian target distribution vanishes. Concretely, we let and define re-scaled variables , and that will remain 𝒪(1) even as we take *ψ* ↓ 0. In terms of these new variables, the acceptance ratio of our probabilistic spiking network is , which tends to as *ψ* ↓ 0. This explicitly recovers the greedy spike rule used in EBNs.

The network with voltage dynamics (12) samples a distribution with mean ** θ** and covariance

**Ψ**. Instead of sampling using the structured proposal distribution on spikes, this network can implement sampling through slowly varying Langevin dynamics on

**. In the limit where the spike rule becomes greedy, this recovers the spiking sampler studied by Savin and Denève [19]. We will discuss this model further in §3.**

*θ*## 3 Population geometry for fast sampling

We now demonstrate how population geometry enables sampling on behaviorally-relevant timescales of tens of milliseconds in two limits of the model introduced in §2.

### 3.1 Leveraging the geometry of inference to accelerate sampling

We first review recent work from the machine learning literature for how population geometry can be chosen to accelerate simple Langevin sampling, which establishes principles for the design of fast samplers. For probability distributions belonging to the exponential family, including the Gaussian distributions on which we focus in this work, one can write the density *P*(** θ**) in terms of an energy function

*U*(

**) such that**

*θ**P*(

**) ∝ exp [−**

*θ**U*(

**)]. The classic algorithm to sample such a distribution is the discretization of the naïve Langevin dynamics where**

*θ***W(**

*t*) is a standard Brownian motion [24–26, 29]. By simply following these dynamics, one can obtain samples from the target distribution and therefore an estimate of the uncertainty at the timescale taken by the network to sufficiently explore the target distribution.

The Langevin dynamics (13) can be directly implemented in a rate network [8, 18] or approximately implemented in a spiking network [19], but their convergence properties for high-dimensional distributions have not been investigated in a neuroscience context. It is well known in statistics that, as the dimensionality of the target distribution increases, convergence of Langevin sampling to the target distribution slows dramatically. Furthermore, the discretization step can induce errors that cause the variance estimated from sampling to exceed the target variance [24–26, 29].

To overcome these issues, prior work in statistics and machine learning has proposed algorithms that can accelerate the sampling [27, 33, 36, 37, 46–49] which we leverage here to propose our fast samplers in spiking neural networks. These ideas were unified into a common framework by Ma et al. [36] (and see also [37, 50]) who proposed a “complete recipe” for stochastic gradient MCMC:
with **BB**^{T} = **D**. The matrix fields **D**(** θ**) and

**S(**modify the dynamics but keep the target distribution unchanged.

*θ*)**D**is positive semi-definite and defines the local geometry of the space in which the inference is occurring, while

**S**is skew-symmetric and adds non-reversible dynamics. When

**D**and

**S**are state dependent, the correction term

**Φ**= div(

**D**+

**S**) must be included [36].

The “complete recipe” provides a general framework to design samplers based on Langevin dynamics. Samplers based on Riemannian geometry can be designed by choosing **D**(** θ**) to be the inverse Fisher information matrix (or an approximation thereof), yielding a preconditioned gradient

**∇**

_{nat}

*U*=

**G**

^{−1}

**∇**

*U*(Figure 1) [49, 51–55]. Samplers incorporating dummy variables can be designed by expanding the parameter space and using the matrices

**D**and

**S**to obtain the desired dynamics [27, 36, 37, 46–48]. Similarly, prior works have proposed methods to accelerate sampling in biologically inspired neural networks by parallelizing the inference [19], using Hamiltonian dynamics [20] or by adding non-reversible dynamics [8]. The “complete recipe” provides a general framework that encompasses all these examples, allowing for the principled design of biologically-plausible samplers and emphasising the importance of population geometry for rapid convergence.

### 3.2 Fast sampling through population geometry in efficient balanced networks

We first consider a sampler based on efficient neural networks [19, 38–40] that leverages the geometry of the inference to implement efficient sampling at the level of the population. In previous work, Savin and Denève [19] derived a sampler implementing naïve Langevin dynamics (we provide a full derivation using the notation from the present paper in Appendix C). Although they proposed to accelerate sampling by implementing parallel inference loops, they do not leverage the geometry of the inference nor do they test the convergence in high (*n*_{p} > 2) dimensions. Here, we propose to use the “complete recipe” for stochastic gradient MCMC [36] to implement a more efficient sampler and show that performing inference on a favorable geometry helps with accuracy in high dimensions.

Following [8, 19], we use a Gaussian linear model in which the network estimates the posterior probability of hidden sources ** θ** given sensory inputs

**x**corrupted by Gaussian noise, (which we take to have unit variance without loss of generality), and prior expectations on the values of the hidden sources

*p*(

**) = 𝒩(**

*θ***; 0,**

*θ***C**). The mean

**and covariance**

*μ***Σ**of the posterior probability of the features given the input,

*p*(

**|**

*θ***x**) ∝

*p*(

**x**|

**)**

*θ**p*(

**), are**

*θ***=**

*μ***Σ**

*A*^{T}

**x**and

**Σ**= (

**C**+

**A**

^{T}

**A**)

^{−1}respectively. Up to an irrelevant constant offset, the corresponding energy function is , with gradient

**∇**

*U*(

**) =**

*θ***Σ**

^{−1}(

**−**

*θ***). Defining**

*μ**τ*

_{s}as the timescale of the Langevin process, this yields the general complete recipe dynamics As we principally study the effect of the geometry, we henceforth set

**S**=

**0**. Note that even though the underlying dynamics implemented by (15) are reversible, the non-linearities introduced by the spiking lead to a non-reversible sampler. As in [19], we approximate

**by its estimate in the dynamics from (12). Then, as shown in Appendix C, this yields an EBN with voltage dynamics where the recurrent weight matrix is**

*θ***Ω**=

**Γ**

^{T}

**Γ**, and the spiking rule is greedy, with thresholds

*T*

_{j}= (

**Γ**

^{T}

**Γ**)

_{jj}/2. Though we consider a Gaussian target distribution, one could build a spiking sampler for any exponential-family distribution by appropriate choice of

*U*(

*θ*). However, this could introduce nonlinear and non-local interactions which would spoil biological interpretability.

To illustrate how correlations between parameters affect sampling in high dimensions, we will focus on equicorrelated multivariate distributions. The covariance matrix **Σ** of such distributions is parameterized by an overall variance *σ* > 0 and a correlation coefficient −1 < *ρ* < 1 such that:
We will explore the performance of the sampling algorithms across values of *ρ* for different dimensions of the parameter (*n*_{p}) and neuron (*n*_{n}) spaces. For a multivariate Gaussian distribution 𝒩(** μ**,

**Σ**),

**the**Fisher information matrix is

**G**=

**Σ**

^{−1}and we will therefore use

**D**=

**G**

^{−1}=

**Σ**(and ) as the space onto we perform inference in our geometry aware implementation (see Appendix C.3). In our simulations, we compare the accuracy of sampling of the naïve implementation (

**D**=

**B**=

**I**) with the geometry-aware version over a 50 millisecond window, which is roughly twice the membrane time constant

*τ*

_{m}= 20 ms, as well as at steady-state. In Figures 2 and D.2, we show that the geometry-aware sampler is more robust to increasing the correlation of the parameters and the dimensionality, allowing inference at behavioral timescales. As is known in Langevin sampling, the discretization introduces excess variance in the inference but this effect, although still present, is strongly reduced in the geometry aware implementation. Note here, that the geometry is imposed through the dynamics of the membrane potentials via the recurrent connectivity in the network.

### 3.2 Fast sampling through population geometry using probabilistic spike rules

In the preceding subsection, we have shown how neural population geometry, implemented through neural dynamics, can accelerate the speed of approximate sampling in an efficient balanced network. However, this approach suffers from a fundamental conceptual gap: the firing rates of the spiking network are being used to simulate non-spiking Langevin dynamics; the spiking network itself has not been designed to sample. Specifically, the discretization introduced by the spiking exacerbates the errors introduced by the discrete time implementation of the sampling dynamics, leading to overestimation of the stimulus variance. In this subsection, we take a first step towards bridging this gap by considering an alternative limit of the general model introduced in §2: the case in which sampling is performed leveraging only the stochasticity in the spike rule. Our objective here is not to demonstrate a fully biologically-plausible or practically useful sampling algorithm; rather, it is to illustrate the importance of population geometry in a minimal model for probabilistic spiking.

As in 3.2, we focus on sampling from equicorrelated Gaussian distributions𝒩(** μ**,

**Σ**). In this case, we set

**Ψ**=

**Σ**and

**=**

*θ***in (12). Naïve sampling here corresponds to choosing**

*μ***Γ**in some sense generically (see Appendix B.5), while geometry-aware sampling corresponds to choosing

**Γ**such that

**ΓΓ**

^{T}≃

**Σ**up to overall constants of proportionality. In Figures 3, 4, and D.3, we show that naïve choices of

**Γ**lead to vanishing spike rates at strong correlations

*ρ*and large parameter-space dimensionalities

*n*

_{p}. This results in dramatic underestimation of the mean and variance of the target distribution, which is resolved by choosing the geometry appropriately, again allowing inference at behavioral time-scales. Moreover, these networks show Poisson-like variability in spiking statistics (Figure 3), consistent with cortical dynamics [56, 57].

In Appendix B.5, we provide a more careful analysis of the strongly-correlated limit *ρ* ↑ 1. Informally, we show that the probability of spiking should vanish if **Γ** is chosen sufficiently naïvely and the mean of the target distribution is uniform across parameter dimensions, i.e., . This analysis is not specialized to a particular case, and holds generally for the model of §2. Therefore, careful choice of population geometry, as implemented by the neural code, is required for fast sampling in this model.

## 4 Discussion

In this paper, we have shown how careful choice of neural population geometry enables fast sampling in spiking neural networks. We presented a unified framework in which EBN samplers approximating Langevin dynamics with greedy spiking and approximate Metropolis-Hastings samplers with deterministic voltage dynamics and probabilistic spiking can be unified. We then leveraged population geometry to perform rapid sampling at behaviorally-relevant timescales in these two disparate limits of our general model. We now discuss some of the limitations of our work, and highlight possible directions for future inquiry.

Like the original EBN model, the probabilistic spiking model introduced in §2 suffers from the limitations that it requires instantaneous propagation of spike information and that only one neuron is allowed to spike at a time [19, 38–41]. Moreover, the discretization timestep enforces a hard cutoff on the maximum spike rate. Some of these limitations could be partially circumvented by generalizing the spike proposal distribution to allow multiple spikes. However, such a model would still suffer from the issue that an accept/reject step that accounts for the effect of spikes from multiple neurons will require instantaneous communication across the network. This limitation could possibly be overcome within the framework of asynchronous Gibbs sampling [58, 59], which ignores the requirement that updates should be coordinated across the network.

The analysis of §2 shows that the models of §3 can be viewed as limiting cases of a single framework. It is likely that the parallels between these limiting models could be further strengthened by viewing the Gaussian noise in the voltage dynamics of the EBN sampler as an approximation to the effect of the stochastic spiking of other neurons in the Metropolis-Hastings sampler. Such an approximation would follow from a central limit theorem-type argument in the limit of large network size, provided that one could neglect the coupling between spikes in different neurons induced by the proposal distribution and treat them as approximately independent [57, 60–62]. Careful analysis of the relationship between these two sources of stochasticity will be an interesting subject for future investigation. Moreover, it will be interesting to investigate how they might be integrated in a single network, which would result in a spiking sampler somewhat reminiscent of the Metropolis-adjusted Langevin algorithms used in machine learning [44, 45].

In this work, we did not constrain neurons to follow Dale’s law, and single neurons therefore have both excitatory and inhibitory effects on their neighbors. Many frameworks have been proposed to map the connectivity of unconstrained network algorithms onto distinct excitatory and inhibitory neuron types [8, 63, 64]. These refinements of the biological plausibility will not affect our key argument of accelerating inference through a favorable population geometry. However, different possible implementations that comply with Dale’s law will make different predictions for experimentally-measureable biophysical properties. For example, although less numerous, fast spiking inhibitory neurons have higher firing rates [63, 64], which could allow the approximate symmetry of the readout, as required by the construction of §2, to be maintained.

In biological spiking networks, probabilistic spike emission and probabilistic synaptic release are natural sources of stochasticity [65–68]. These two layers of probabilistic computation provide additional flexibility in processing beyond the simple accept/reject step considered here. As a result, it is likely that one could construct sampling algorithms that are at once more biophysically detailed and more computationally efficient than the simple network constructed in §2. The algorithms proposed in this work could also enable fast sampling in neuromorphic circuits [69, 70]. As they require only the local membrane voltage to compute accept/reject steps, these algorithms would potentially be only limited by the timescale of individual computing units (which are much faster than biological neurons) rather than the dimensionality of the inference problem [71].

The sampling processes considered in this work focus on short-timescale perceptual inference. However, similar probabilistic inference can occur at longer timescales of learning [72–74]. These long-timescale sampling processes would allow networks to flexibly infer their synaptic weights, and thus adapt the geometry of the population code to match changing stimulus geometry through meta-learning [75]. Recent works have analytically studied the population geometry that results from this sampling procedure in rate-based networks [76], and proposed algorithms for efficient learning in EBNs [40, 77–79] and other classes of spiking networks [80–82]. However, the interactions between fast activity sampling and slow network parameter sampling in neural circuits—particularly spiking networks—remain poorly understood [74]. Characterizing how adaptive population geometry accelerates perceptual inference in dynamic environments will be an important step towards a more complete understanding of probabilistic inference in neural circuits.

## Supplemental Information

### A Table of variables and parameters used in the models and simulations

We provide a table stating the dimensionality and values taken by the variables and parameters used throughout the paper.

### B Metropolis-Hastings sampling using probabilistic spiking rules

In this Appendix, we provide a step-by-step construction of the spiking sampler introduced in §2 of the main text. As introduced in §2 of the main text, our goal is to use
to sample a Gaussian distribution with mean ** θ** and covariance

**Ψ**, where is the filtered spike history for a decay constant 0 ≤

*η*≤ 1 (note that ). In §B.1, we construct a circuit that samples a Gaussian distribution using a discrete-time perfect integrator of the spike train (i.e.,

*η*= 0). In §B.2, we relax this assumption, yielding an approximate sampler using leaky integration in discrete-time, and discuss the behavior of this model in the continuum limit.

#### B.1 A simple sampler assuming perfect integration and balance

We first construct a sampling circuit under the assumptions that we have access to a perfect integrator of the spike train, i.e. that *η* = 0, and that the readout matrix is of the form
for some matrix .

At the *t*-th timestep, we choose one neuron *j* uniformly at random, and let the spike proposal be **o**′ = **e**_{j}, where **e**_{j} is the *j*-th standard Euclidean basis vector (i.e., (*e*_{j})_{k} = *δ*_{jk}). This yields a candidate readout
Under the symmetry assumption on **Γ**, the acceptance ratio is given by
as the proposal distribution is exactly symmetric in and . Then, we accept the proposed spike with probability *A*. Concretely, for *u* ∼ 𝒰[0, 1], we take
For a Gaussian distribution with mean ** θ** and covariance

**Ψ**, we have Defining the matrix we interpret the first term in the log-odds ratio as a membrane potential, and the second as a threshold such that the acceptance ratio is With this definition, the membrane voltage simply integrates the spike train: We now observe that, if

*n*

_{n}>

*n*

_{p}, it is possible that a spike may not contribute to the parameter estimate. Concretely, we say that a spike, or the corresponding neuron, is

*irrelevant*if it is annihilated by

**Γ**, i.e., if

**Γe**

_{j}=

**0**. Irrelevant spike proposals are always accepted with probability one, as we have . Moreover, the membrane voltage is not changed by the emission of irrelevant spikes. For these reasons, we are free to re-define the population rate

**r**

_{t}to exclude such spikes. Therefore, a timestep with an irrelevant spike is equivalent to not updating the network at all, and we could choose to re-define the network such that only relevant neurons are included. However, though there will exist some non-trivial set of vectors that are annihilated by

**Γ**if

*n*

_{n}>

*n*

_{p}, the situation in which this null space is axis-aligned (thus implying the existence of irrelevant neurons) is not generic.

We note that appropriate initialization of the membrane potential (for the desired mean) is important, as otherwise some bias will be introduced. Thus, as written, this model cannot easily accommodate a time-varying mean signal *θ*_{t}. This shortcoming could obviously be addressed by taking
which yields voltage dynamics
Alternatively, one could also take
and re-define the threshold for the *j*-th neuron to be time-varying:
The former of these approaches is reasonable from a biological perspective, as the new term **Γ**^{T}**Ψ**^{−1} (*θ*_{t} − *θ*_{t−1}) in the voltage dynamics has the interpretation of a signal *θ*_{t} − *θ*_{t−1} fed through an input weight matrix **Γ**^{T}**Ψ**^{−1}. For this sampling procedure to work, the mean should be slowly-varying.

#### B.2 Relaxing the assumption of perfect integration

We now relax the assumption of the perfect integration, and assume only that *η* ≪ 1. With the same proposal distribution as before, we take the acceptance ratio of the accept-reject step to be
As noted in the main text, this choice implements a sort of look-ahead step. Moreover, if we took the acceptance ratio to depend on the likelihood of the proposed state with decay, *P*[(1 − *η*)**Γr**_{t−1} + **Γe**_{j}], relative to the likelihood *P*[**Γr**_{t−1}] of the current state (rather than the next state with decay but without the proposed spike), the resulting log-odds ratio would include terms of order *η* that are quadratic in the rate **r**_{t}.

With the choices above, the log-odds ratio is
which, like in the perfect integrator model, is linear in the rate. As in the perfect integrator model, we define the recurrent weight matrix
and interpret the first term in the log-odds ratio as a membrane potential,
and the second as a threshold
such that the acceptance ratio is
With this definition, the membrane voltage evolves as
Therefore, this model differs from the perfect integrator model of §B.1 only in the voltage dynamics; the perfect integrator is recovered exactly if we set *η* = 0. As in the perfect integrator model, the natural generalization of these leaky dynamics to time-varying mean signal *θ*_{t} is to take
which leads to the dynamics
This will, of course, not be an exact sampler unless the mean is constant. If the mean is slowly-varying, however, it should be a good approximate sampler.

#### B.3 Adding an elastic net prior on the rates

We now consider the effect of adding an elastic net prior on the rates, as was considered in the original work of Boerlin et al. [38]:
Without loss of generality, we consider the case in which a decay term is included, as we can then recover the perfect integrator by setting *η* = 0. Again defining the acceptance ratio in terms of a comparison against the rate with decay but without the proposed spike, the addition of the prior adds
to the log-odds ratio. This yields a modified recurrent weight matrix
and a modified threshold
but the expression for the membrane voltage in terms of these parameters is identical in functional form:
Therefore, adding the elastic net prior changes the definitions of the weight matrix that maps rates to voltages and of the threshold, but not the overall form of the result, hence it does not introduce any new conceptual difficulties. We remark that we could include the constant factor *α* in the membrane voltage as we do in our implementation of EBNs (see Appendix C) rather than in the threshold, which would give it a somewhat different biological interpretation but would have no algorithmic effect.

#### B.4 The continuous-time limit

In this subsection, we consider the continuous-time limit of the models introduced above. This limit corresponds to taking the limit in which spike proposals are made infinitely often, and regarding the dynamics written down previously as a forward Euler discretization of an underlying continuous-time system. For clarity, we write the discrete timesteps, denoted in previous sections simply by *t*, as *t*_{d} here, and reserve the unsubscripted symbol *t* for the continuum variable. For a timestep Δ, we let *t* = Δ*t*_{d}, and let the discrete-time decay rate be *η* = Δ /*τ*_{m} for a ‘membrane’ time constant *τ*_{m}. We may then write the discretized rate dynamics (B.2) as
which, taking Δ ↓ 0, of course yields the familiar continuous-time dynamics
In continuous time, the spike train **o(***t*) is now composed of Dirac delta functions, as the discretized spikes are rectangular pulses of width Δ in time and height 1/Δ. We next consider the voltage dynamics of the leaky integrator for a varying mean signal (B.26), which may similarly be re-written as
In the continuum limit, we retain only the contribution of the first of the two terms involving the discrete-time spike train, as the other yields pulses of width Δ and height unity, which yield a negligible contribution to the integral. Thus, we have the dynamics
From these dynamics, one could then recover the continuum limit of the perfect integrator dynamics by taking *τ*_{m} → ∞. In this limit, the rate will decay by only a infinitesimal amount between a rejected spike proposal and the next proposal, meaning that the error incurred by neglecting the asymmetry in the proposal distribution due to the decay should be negligible.

#### B.5 Analyzing the strongly-correlated limit of sampling from an equicorrelated Gaussian

Here, we consider how these models behave when sampling from an equicorrelated Gaussian distribution, i.e., a distribution with covariance matrix
for correlation coefficient *ρ* ∈ (−1, +1). We are particularly interested in the strongly-correlated limit *ρ* ↑ 1. As we will consider the case in which the marginal variance does not scale with the correlation, our choice of unit marginal variance is made without loss of generality. For this covariance matrix, the Sherman-Morrison formula yields [83]
We first consider the case in which the mean signal ** θ** is identically equal to zero. We now argue that, for choices of

**Γ**that are in some sense sufficiently naïve, the probability that relevant spikes are emitted should tend to zero as

*ρ*↑ 1. This corresponds to showing that

*V*

_{t,j}−

*T*

_{j}→ −∞ as

*ρ*↑ 1 for all indices

*j*corresponding to relevant n eurons. For

**=**

*θ***0**, we have Consider the first timestep, with

**r**

_{0}=

**0**, for which we have Near

*ρ*= 1, we then have the expansion under the assumption that

**Γ**is an 𝒪(1) function of

*ρ*in this region, and thus cannot introduce additional possible divergences. For relevant spikes, we have the strict inequality (

**Γ**

^{T}

**Γ**)

_{jj}> 0. Moreover, by the Cauchy-Schwarz inequality, we have , with equality if and only if . If , then spikes in neuron

*j*affect the readout precisely only along the common mode, and

*V*

_{0,j}−

*T*

_{j}does not diverge as

*ρ*↑ 1. However, this case is quite fine-tuned. For generic

**Γ**not satisfying this alignment condition, we have the strict inequality for all relevant neurons. Then, if (

**Γ**

^{T}

**Γ**)

_{jj}vanishes no more rapidly than 1 −

*ρ*as

*ρ*↑ 1—i.e., if (

**Γ**

^{T}

**Γ**)

_{jj}/(1 –

*ρ*) → ∞ as

*ρ*↑ 1—we have that

*V*

_{0,j}−

*T*

_{j}→ − ∞ as

*ρ*↑ 1. This holds, for instance, if (

**Γ**

^{T}

**Γ**)

_{jj}is a constant function of

*ρ*. Thus, under these conditions, the probability that a relevant spike is emitted at the first timestep vanishes as

*ρ*↑ 1. Heuristically, this in turn implies that the (relevant component of the) rate at the second timestep will be zero with probability one. Therefore, we may iterate this argument forward in time, showing that the probability of emission of relevant spikes should vanish in the limit

*ρ*↑ 1. This argument will not be affected by the addition of an elastic net penalty unless the coefficients

*α*and

*λ*are taken to diverge as

*ρ*is taken to unity, as the coefficients are strictly non-positive.

The situation is somewhat more complicated if the mean signal is not identically zero, in which case we have
Following our previous analysis, at the first timestep we have
near *ρ* = 1. There are now two possible divergent terms, which can compete to change the sign of *V*_{0,j} − *T*_{j} as *ρ* ↑ 1. One case of interest is when . Then,
and we have
as in the case when ** θ** was strictly zero, implying that

*V*

_{0,j}−

*T*

_{j}→ −∞ under the abovementioned conditions on

**Γ**. Another illustrative case is

*θ*_{t}=

**Γe**

_{j}. With this fine-tuning, hence we expect

*V*

_{0,j}−

*T*

_{j}→ +∞ as

*ρ*↑ 1 under the abovementioned constraints on

**Γ**. Thus, in this case, the spike probability should tend to one, showing that complications can arise in the case of a non-uniform mean signal.

### C Sampling in efficient balanced networks

#### C.1 Encoding a dynamical system in efficient balanced networks

In this section, we provide a pedagogical derivation of the efficient balanced network [19, 38] using the notation we use throughout the paper. The goal of the network is to encode an estimate of a signal ** θ** (a vector of size

*n*

_{p}× 1 in a population of

*n*

_{n}spiking neurons. The estimate is obtained by reading out a low pass version of the population spiking activity: where

**Γ**is the

*n*

_{p}×

*n*

_{n}readout matrix and

*r*(

*t*) is the low-pass filter spike history: where

*τ*

_{m}is the time constant of the readout neurons and

*o*

_{i}is the spike train of neuron

*i, o*

_{i}= 1 if the neuron spiked at

*t*and

*o*

_{i}= 0 otherwise.

The goal of the network is to minimize the squared error between the signal and the estimate with an elastic net prior on the firing rate in order to find a good solution while keeping the population spiking activity relatively low:
In the standard implementation of efficient balanced networks, neurons use a greedy spiking rule. A neuron should fire if emitting a spike will lower the loss function, i.e., if
Using the loss function and the definition of the estimate , we can rewrite the spiking rule for neuron *i* as:
Removing terms that appear on both sides, we get
which we can further simplify to
The term on the left hand can be interpreted as the voltage potential of neuron *i*:
while the term on the right hand size can be interpreted as the firing threshold^{2}:
If the voltage potential exceeds this threshold, then the neuron will fire and lower its voltage potential back below threshold. Using (C.8) we can express the dynamics of the voltage potential as a function of the dynamics of ** θ** and its estimate :
We can rewrite this equation to obtain an expression for the membrane dynamics as a function of the signal, the firing rates, and the spike trains of the neurons. By adding and subtracting and noting that , we further simplify the expression to obtain membrane dynamics as a function of the membrane potential, the effect of a new spike on the circuit and the encoded dynamical system:
We therefore obtain the membrane dynamics for the efficient balanced spiking network proposed in [19, 38]. We can rewrite the dynamics in vector form:
Using this scheme one can encode of variety of dynamical systems, including Langevin dynamics [19, 39, 40].

#### C.2 Sampling in efficient balanced networks using naïve Langevin dynamics

Using the scheme presented in §C.1 above, Savin and Denève [19] proposed to implement a dynamical system corresponding to the naïve Langevin dynamics of a multivariate normal distribution. We use a linear Gaussian model similarly to several studies of neuroscience inspired sampling-based networks [8, 19].
These dynamics can be approximated by an efficient balanced network by replacing in (C.14) ** θ** by . As discussed in [19], this approximation will introduce an acceptable error in most practical situations:
and we therefore obtain the membrane dynamics for the efficient balanced network proposed by Savin and Denève [19]:
Note here that we have two timescales. The timescale of neuronal representations

*τ*

_{m}∼ 20 ms controlled by the biophysical properties of the neurons and a the timescale

*τ*

_{s}of the Langevin diffusion encoded by the network.

#### C.3 Sampling in efficient balanced networks using the complete recipe for stochastic gradient MCMC

The model proposed by Savin and Denève [19] implements naïve Langevin dynamics, which are known to be slow in high dimensions. Instead, we can use the “complete recipe” for stochastic gradient MCMC [36] to write another sampler with the same equilibrium distribution but more favorable convergence properties. For any positive semi-definite matrix **D** and skew-symmetric matrix **S**, the following dynamics will converge to 𝒩(** μ**,

**Σ**) as their stationary distribution: with

**BB**

^{T}=

**D**.

We can use (C.14) to encode this more general formulation into an efficient balanced network, yielding the following membrane dynamics:
Any choice of positive semi-definite matrix **D** leads to a valid sampler, but extensive work inspired by Amari’s seminal work on natural gradient descent [51, 52] has shown that a principled choice is to take **D** to be the inverse of the Fisher information matrix **G** [36, 49]. For the multivariate Gaussian distribution, the Fisher information matrix is given by:
We should therefore choose **D** = **G**^{−1} = **Σ**. Note that for the multivariate Gaussian distribution, the Fisher information matrix is identical to the Hessian and is location independent - it does not depend on the value of ** θ**. For more complex distributions, the Fisher information matrix might be difficult to compute and an approximation can be used as long as it is valid (positive semi-definite) within the complete recipe framework [36, 55, 84] and state dependent matrices can be corrected for using the term

**Φ**from the complete recipe in (14) [36].

In this work, we have considered only hand-tuned or random choices for the matrices controlling the geometry. Previous work by Hennequin et al. [8], when framed within the complete recipe, proposes methods to find a skew-symmetric matrix **S** which accelerates the dynamics. Non-reversibility is indeed known to accelerate learning, but analysing networks with such dynamics is notoriously difficult [33, 85]. In contrast, approximations for the inverse Fisher information matrix are readily computable, even in non-Gaussian settings [55, 84].

### D Numerical methods and supplementary figures

In this appendix, we describe our numerical methods and include supplementary figures. All simulations were run in Matlab 9.10 (R2021a) or 9.12 (R2022a) (The MathWorks, Natick, MA, USA) on desktop workstations (CPU: Intel i9-9900K or Xeon W-2145, 64GB RAM). They were not computationally intensive, and required less than 24 hours of compute time in total. The code used to generate all figures will be deposited in a public repository upon acceptance.

For the sweeps in *ρ* we tested 100 values of *ρ* ∈ [0; 0.99]. For the dimension sweeps we varied *n*_{p} ∈ [2, 4, 8, 16, 32, 64] and is the number of neurons per parameters. Here, we showed results for *k* = 10. As expected, sampling failed in the case of *k* = 1 as the sign constraint introduced by spiking restricts the network to efficiently sample only one half on the values for each parameter but results were qualitatively similar for *k* = *n*_{n}/*n*_{p} ∈ [5, 10, 20].

In Figures 2, D.1, D.2, 4, and D.3, convergence statistics are computed based on distributions of samples over time, that is, the values visited by the sampler over the course of a single trial. Note that this is different than many machine learning studies, which instead consider distributions across realizations at a single timepoint. However, for an organism, probabilistic inference must be performed within a single trial. Because estimation of the full 2-Wasserstein distance between high-dimensional distributions is computationally expensive [86], we instead computed the mean across dimensions of the 2-Wasserstein distances between the marginals of the sampling and target distributions.

For Figures 2, D.1 and D.2, we sampled with a discretization timestep of Δ = 10^{−4} s and a membrane time constant of *τ*_{m} = 20 ms. **Γ** by drawing independent and identically distributed Gaussian element. We uses *τ*_{s} = 0.01*τ*_{m} and scaled the elastic net regularization parameters using .

For Figures 3, 4, and D.3, we simulated a Metropolis-Hastings sampling network with a discretization timestep of Δ = 10^{−5} s and a membrane time constant of *τ*_{m} = 20 ms. In the naïve case, we generated the readout matrices as **Γ** = [−**Z, Z]** for random matrices **Z** with independent and identically distributed Gaussian elements. To perform sampling in approximately the natural space, we chose **Γ** = **Σ**^{1/2}[−**Z, Z]** for **Z** a random matrix with i.i.d. Gaussian elements. For the dimension sweeps in 4c and D.3b, we scale the variance of the elements of **Z** to be 1/*n*_{p}. The distributions in Figure 3b were generated using 1000 realizations of the randomness in the proposal and accept/reject steps for a single realization of the random matrix **Z**. Statistics in Figures 4 and D.3 were computed across 100 realizations of the random matrix **Z** and of the randomness in the proposal and accept/reject steps.

## Acknowledgements

PM was supported by the Harvard Mind Brain Behavior Interfaculty Initiative. JAZ-V and CP were supported by a Google Faculty Research Award and NSF Award #2134157. VNM and this work were also supported by NIH grant R01DC016289 and NTT Research A47994.

## Footnotes

↵6 VNM and CP jointly supervised this work

↵

^{1}In Appendix B.3, we show how the elastic net prior on firing rates used by Boerlin et al. [38] can be incorporated into this model. As it only modifies the definitions of the recurrent weights and spiking threshold, this extension does not add new conceptual difficulties, hence we do not discuss it further in the main text. Additionally, we provide a pedagogical introduction to the dynamics of the EBN in Appendix C.↵

^{2}Here, we chose to include the regularizing term*α*as a fixed offset in the voltage potential but it can equivalently be included as an offset in the spiking threshold*T*_{i}, as discussed in Appendix B.3.

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].
- [18].↵
- [19].↵
- [20].↵
- [21].
- [22].
- [23].↵
- [24].↵
- [25].
- [26].↵
- [27].↵
- [28].
- [29].↵
- [30].↵
- [31].
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].
- [54].
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].
- [67].
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].
- [79].↵
- [80].↵
- [81].
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵