## Abstract

Correlations in sensory neural networks have both extrinsic and intrinsic origins. Extrinsic or stimulus correlations arise from shared inputs to the network, and thus depend strongly on the stimulus ensemble. Intrinsic or noise correlations reflect biophysical mechanisms of interactions between neurons, which are expected to be robust to changes of the stimulus ensemble. Despite the importance of this distinction for understanding how sensory networks encode information collectively, no method exists to reliably separate intrinsic interactions from extrinsic correlations in neural activity data, limiting our ability to build predictive models of the network response. In this paper we introduce a general strategy to infer population models of interacting neurons that collectively encode stimulus information. The key to disentangling intrinsic from extrinsic correlations is to infer the couplings between neurons separately from the encoding model, and to combine the two using corrections calculated in a mean-field approximation. We demonstrate the effectiveness of this approach on retinal recordings. The same coupling network is inferred from responses to radically different stimulus ensembles, showing that these couplings indeed reflect stimulus-independent interactions between neurons. The inferred model predicts accurately the collective response of retinal ganglion cell populations as a function of the stimulus.

## I. INTRODUCTION

An important goal in sensory neuroscience is to build network models to understand how sensory stimuli are encoded by the collective activity of neuronal populations. Pioneering work initiated in the retina [1–4] proposed to use disordered Ising models to characterize the joint activity of neurons in early sensory systems [5–9], sensory cortices [10] and beyond [11–14]. Motivated by the principle of Maximum Entropy [15], these models represent neurons as binary spins (spike or silence), biased by local fields and interacting through a network of pairwise couplings. The model parameters are fitted to reproduce the empirical mean neuron activities and pairwise correlations between them.

A major limitation of this approach in the context of sensory systems is that the inferred couplings are only effective and do not directly follow network circuitry. Instead, they reflect two different sources of correlations. On one hand, two neurons can be correlated because they receive correlated or common inputs. For example, in the retina, if the stimulus is correlated over space, nearby neurons will receive similar inputs and consequently will respond synchronously. This type of correlation has been termed ‘signal correlation’ and strongly depends on the actual stimulus and its statistics. On the other hand, neurons might be correlated because of actual interactions in the neural network, either because they are connected directly through gap junctions [16, 17] or indirect pathways [18], or because they receive the same noise sources from photoreceptors [19]. These types of correlations have been termed ‘noise correlations’ and results from the physological wiring of the network [18]. Similar network effects are present in all sensory systems [20, 21]. When fitting Ising models, signal and noise correlations are mixed together and difficult to disentangle. Consequently, the inferred couplings do not only reflect properties of the network circuitry, but also incidental properties of the actual stimulus.

The inability to separate extrinsic from intrinsic correlations limits the interpretability of these models and their capacity to generalize across different conditions. For example, if two Ising models are trained on the neural responses to two different types of stimuli, their interactions terms will be different [5]. The model thus cannot generalize and fails to predict the collective behaviour in response to a different type of stimulus. Interpreting a change in the interaction terms will also be difficult: it could trivially reflect changes in the stimulus statistics, or could correspond to changes in the network of couplings, and the way the network processes stimuli (adaptation). Modeling the influence of the stimulus is therefore crucial to understand the collective processing performed by sensory networks. Models with stimulus-dependent fields and couplings have been proposed to describe stimulus correlations and network effects [7, 22, 23]. However, here we argue that the strategy proposed to fit these models from neural recordings does not ensure a proper disentanglement between these two sources of correlated activity. Thus, a general strategy to model and accurately separate stimulus and noise correlations in neural networks is still lacking.

Here we propose a general method to achieve this task. We define a population encoding model where each neuron’s spiking probability is governed by its couplings to other neurons, and by an external, time-dependent field encoding the effect of the stimulus. We describe a working strategy for learning the parameters of this model from neural recordings. First, we infer the coupling matrix from population responses to repetitions of short films. Second, we model how each neuron’s firing rate depends on the stimulus, with no regard for noise correlations. Third, we use a mean field (Thoughless-Anderson-Palmer) approximation to calculate the value of the fluctuating field as a function the stimulus from those predicted rates, corrected by the influence of the network.

We apply our model to describe the responses of retinal neurons to a visual stimulus. We quantify the importance of noise correlations in this system, and we fit the corresponding coupling matrix from recordings of responses to repeated films. We then combine this coupling matrix with a previously proposed model of stimulus encoding fitted on single cells [24] to obtain a complete model that reproduces the population response. We show that this strategy can be used to obtain accurate predictions across different stimulus statistics. We have therefore found a way to design and train models of population responses that can generalize across different stimulus ensembles. The method can be applied to any system where a singlecell encoding model is available.

## II. A GENERAL POPULATION MODEL

### A. Model definition

We start by introducing a general model of the activity of a population of correlated neurons, labeled by *i* = 1,…,*N*, in response to a stimulus. This probabilistic model accounts for both arbitrary single-cell dependences on the stimulus, and direct interactions between cells. Let us denote by *n _{i}* the number of spikes emitted by cell

*i*during a short time bin. The probability distribution of spiking patterns

**n**= (

*n*) in response to a time-dependent stimulus

_{1},…,n_{N}*S*is given by: where

*Z*is a normalization constant and

*H*and

_{1}*H*encode extrinsic and intrinsic sources of correlations in the population, respectively. The first term

_{2}*H*

_{1}accounts for the behaviour of single cells in response to the stimulus.

*h*(

_{i}*t*) =

*ĥ*[

_{i}*S*] corresponds to a time-dependent external field applied to neuron

_{t}*i*, which reflects the influence of the past stimulus

*S*at time

_{t}*t*. The functional form and parametrization of

*ĥ*as a function of the stimulus will be prescribed later, and depend on the particular sensory system and stimulus of interest. The quadratic, cubic, and factorial terms in

_{i}*n*(

_{i}*t*) in

*H*correspond to a correction to the Poisson distribution of spikes allowing for general dependencies between the mean and variance of

_{1}*n*. These corrections have been shown to be essential for describing single neurons [25]. The second interaction term

_{i}*H*is parametrized by a matrix of couplings between neurons,

_{2}**J**= (

*J*).

_{ij}Given recordings of the activity of a neural population presented with a known sensory stimulation, the goal is to infer the parameters of the model to best predict the collective response to arbitrary stimuli.

### B. Why direct likelihood maximization cannot be used

Our goal is to infer couplings that solely reflect noise correlations, for two reasons. First, this makes the values of the different parameters easier to interpret. Second, while the stimulus correlation will systematically change with the stimulus statistics, noise correlations may reflect some intrinsic network properties, and the corresponding couplings should be robust to the stimulus statistics. By separating the two types of correlations, we want to develop models that can generalize and predict responses to stimulus ensembles that are radically different from the ones they were trained on.

One strategy to infer the population model (1) could be to estimate all the parameters by maximizing the likelihood on the complete dataset. However, this approach does not explicitly separate noise and signal correlations in the data, and is expected to misestimate the coupling matrix in the inference procedure. This effect comes from the fact that the stimulus-encoding model is never perfect. As a result, when the stimulus-dependent fields *h _{i}*(

*t*) fail to perfectly reproduce single neuron activities, the interaction field may try to compensate this error by drawing additional information about the stimulus from the activity of other neurons, instead of reserving these couplings for accounting for noise correlations. Previous work [22] combining a simple (i.e. linear) stimulus-encoding model with neuron-neuron couplings, and using maximum likelihood as an inference method, shows a clear example of this effect: the addition of couplings improves the prediction of single-cell firing rates (Fig. 2b of [22]). An extreme version of this phenomenon is also demonstrated in [9], Fig. 15, where the response of one neuron to a natural scene is predicted from the responses of the other neurons without any use of the stimulus.

It should also be noted that the full maximum-likelihood task can be computationally hard. The inference of precise single-cell encoding models such as described in [24] and the inference of complete interaction networks are each computationally costly, and combining both in a single maximum-likelihood maximization would require developing new methods and algorithms.

To demonstrate the effect described above in a computationally tractable case, we inferred a Generalized Linear Model (GLM) to our retinal data (see Sec. IIIA for a description of the data). The GLM is arguably the most popular model to account for some dependence on the stimulus and for noise correlations using linear filters [23]. Note that the GLM is not expected to accurately describe retinal activity because of its linear assumptions [24], but it can still be used to make our point that its inference mixes signal and noise correlations. As shown in App. A, straightforward likelihood maximization does not guarantee that coupling terms only reflect noise correlations; rather, they reflect an uncontrolled combination or mixture of signal and noise correlations, which have very different biological interpretations.

In summary, from previous observations and the analysis on our own data with the GLM, we conclude that direct maximum-likelihood optimization of the model’s parameters is not appropriate and will give biased estimates of the model parameters, and in particular of the coupling network.

### C. Inference of neuronal couplings

In order to obtain the parameters values *J _{ij}* that account solely for noise correlations, we need a model that reproduces perfectly the time course of the firing rate of each neuron, thereby setting aside the question of its dependence on the stimulus. To construct such a model, we use repetitions of the same stimulation to estimate the empirical firing rate of cell λ

_{i}(

*t*) (the average number of spikes in each time bin across repetitions). We then let the time-dependent fields

*h*(

_{i}*t*) in the population model (1) be inferred from the responses to the repeated stimulus along with the couplings

*J*, by maximizing the likelihood following [22]. In the maximum-likelihood fit, the fields

_{ij}*h*(

_{i}*t*) act as Lagrange multipliers (or chemical potentials if the spike count is viewed as a particle number) enforcing the value of firing rate of each neuron in each time bin, , while the couplings

*J*enforce the noise correlations averaged over the repeated stimulus: . Thanks to these constraints, the time courses of the firing rates are exactly reproduced by construction, and so are the resulting stimulus correlations. The inference of the couplings thus only reflects noise correlations.

_{ij}Note that the model inferred in this way is not a stimulus encoding model. It cannot predict the spiking activity in response to a different stimulus sequence than the one used in the repetition. To do this, we need to learn how *h _{i}*(

*t*) depends on the stimulus, which is the object of the next step of the inference.

### D. Conditionally independent model

Let us now assume that a model can be built to predict the firing rate λ_{i}(*t*) of each neuron as a function of the presented stimulus, . This step depends on the specific sensory system studied, as well as on the stimulus ensemble. We will see an explicit example of such as model in the case of the retina in Sec. IIID.

From these firing rate predictions, we derive a noninteracting model of neurons in which the interaction term *H _{2}* has been removed:
where the fields are set as a function of the stimulus, , to enforce the constraint . This model is conditionally independent, meaning that neurons respond independently of each other, when conditioned on a given stimulus.

### E. Putting it together: mean-field correction to network effects

Now that we have inferred the parameters of the interaction network (*J _{ij}*) and of the conditionally independent model (the functions ), the last step is to combine them to obtain the complete population encoding model (1). In doing so, we must be careful to correct for the effect of the network on the activity of each neuron.

Because of the interaction term H_{2}, each cell receives an additional field from the rest of the network. This field is stochastic and explains noise correlations. However, it also generates a mean contribution that affects the cell firing rate. This contribution thus needs to be removed from the stimulus-dependent field: .

To estimate this correction, we compute the Thouless, Anderson and Palmer (TAP) free energy formalism of the model (1) [26] and use it to derive an approximation for Δ*ĥ _{i}*. We followed [27] and [28] and apply a second-order Plefka expansion [29] (see App. B for more details). The result is:
where the first two terms are the mean-field contributions of network and self couplings, whereas the lasts two are their Onsager [30] reaction terms. In (5) we have introduced the functions:
where denote averages according to the distribution:
where

*h*(λ) is set to enforce 〈n〉 = λ. Note that the correction (5) only depends on the stimulus through the predicted firing rates , and not on the spike counts

*n*(

_{i}*t*), as desired.

This computation completes the procedure. The population model (1), now endowed with stimulus-dependent field functions , spike count parameters *γ* and *δ*, and network couplings *J _{ij}*, can be used to predict the population response to any stimulus

*S*. Next we apply the procedure to the retinal network.

## III. APPLICATION TO THE RETINA

### A. Description of the data

We reanalyzed a dataset of *ex-vivo* recording of the activity of retina ganglion cells (RGCs) in Long-Evans adult rats using a multi-electrode array [24] (see Fig. 1A, top for an example response). The stimulus was a movie of two parallel bars, whose trajectories followed the statistics of two independent overdamped stochastic oscillators. Additionally, a white noise stimulus (random binary checkerboard) was projected for one hour to allow for receptive field estimation and RGC type identification. Raw voltage traces were stored and spike-sorted off-line through a custom spike sorting algorithm [31]. We applied a standard clustering analysis based on the cell response to various stimuli to isolate a population of OFF ganglion cells of the same type. Their receptive fields tiled the visual field to form a mosaic (Fig. 1B).

The ganglion cell spike times were binned in (1/60)s time windows (locked to the stimulus frame rate) to estimate the empirical spike counts, *n _{i}*(

*t*) for cell i in time bin t. The stimulus alternates between non-repeated sequences of random bar trajectories, and a repeated sequence of randomly moving bars, displayed 54 times. Non-repeated sequences were divided into training (two thirds of the total) and testing (one third) sets. Repeated sequences were equally divided into training and testing sets by splitting the repeated trajectory in two different halves.

### B. Stimulus and noise correlations

Before describing the inference of the model, we first briefly characterize the amount and properties of the stimulus and noise correlations. To this end, we estimate the mean firing rate λ_{i}(*t*) as a function of time in response to the repeated stimulus sequence as the empirical mean of *n _{i}*(

*t*) across repetitions (Fig. 1A, bottom), and its temporal average as , where

*t*= 1,…,

*T*spans the duration of the repeated stimulus. We measure the covariance between pairs of cells (example in Fig. 1C) computed from the repeated dataset between two cells. The total pairwise covariance, represented in green in Fig. 1C, is the sum of stimulus and noise covariances, with where 〈⋅〉

_{rep}represents averages over stimulus repetitions. The stimulus covariance, shown in black in Fig. 1C, can be calculated from the empirical firing rates λ

_{i}(

*t*). The noise covariance corresponds to the difference between the total and stimulus covariance. Fig. 1C shows that this difference is significantly different from zero only at zero lag (τ = 0), meaning noise correlations happen at a short time scale. This suggests they may be due to gap-junctions [18, 32]. Only cells that are physically close (as measured by the distance between the neurons’ receptive fields) have large noise correlations (Fig. 1D), and their values strongly decreased with distance.

### C. Coupling network

We applied the procedure described in Sec. II C to the responses to the repeated stimulus to learn the coupling matrix **J**. The inference was performed by maximizing the log-likelihood (see for example [33]). We added a small L2 penalty (with coefficient *η* _{L2} ~ 2 ⋅ 10^{-6}) on the fields *h _{i}*(

*t*) to avoid divergences to -∞ when λ

_{i}(

*t*) = 0. In order to avoid spurious non-zero values, we also add an L1 penalty (with coefficient

*η*

_{L1}= 0.04). We further imposed that Jii be independent of i, for consistency with the single-cell model (see below).

Fig. 2 shows the results of the inference. To evaluate the robustness of the inference with respect to a change of stimulus realization, in Fig. 2A we plot the interactions inferred from the training set against those inferred from another training set of the same size, where the bars followed a different trajectory. The comparison shows that inferred networks are robust against a change of stimulus.

To check the validity of this approach, in Fig. 2B we compare empirical noise correlations obtained with the test dataset with those predicted by the model. To obtain this prediction, we freeze the **J** matrix obtained from the inference on the training set and we re-infer the *h _{i}* (

*t*) to match the firing rates of the testing set. The inferred coupling matrix is able to well predict the noise correlation on a part of the recording that had not been used to learn them. Fig. 2C shows the behavior of the interaction parameters as a function of the distance between the two neurons’ receptive fields.

*J*decreases with distance slightly faster than noise correlation (see Fig. 1 for comparison).

_{ij}### D. A feed-forward single-cell model

To apply the procedure of Sec. IID, we need a model for the encoding of the stimulus by single neurons. We use a previously proposed feed-forward model [24] that was specifically developed to predict responses to two-bar stimuli. The stimulus *S*(*x,t*), representing the time behavior of each pixel, is first convolved with a Gaussian and biphasic factorized kernel *K*_{BP}(*x,t*) and then passed through rectified quadratic units with the two possible polarities:
where [*y*]_{+} = max(*y*, 0) and [*y*]_{-} = min(*y*, 0). The intermediate variable Θ(*x, t*) is then fed into a second nonlinear stage: for each cell *i*, it is first convolved with a receptive field and then passed through a nonlinear function:
with *f _{i}*(

*y*) =

*a*ln[1 + exp (

_{i}*b*(

_{i}*y*+

*c*))].

_{i}To infer the parameters of the model we follow a simplified version of [34], where only the second non-linear stage is learned. We keep fixed the first stage on a setting that has been shown to work efficiently [24]. For the second stage we apply an iterative procedure where we maximize the log-likelihood of the data under the model given by Eq. 4, penalized by a weighted sum of the L1 and L2 norms of the parameter vector. We used the non-repeated training set to compute the log-likelihood gradient. To avoid overfitting we early-stopped the iterative procedure when the log-likelihood computed on the non-repeated testing set stopped to increase. L1 and L2 penalties were optimized by maximizing the performance on the repeated dataset that later we will use for training the population model.

In Fig. 3A we compare the time course of the empirical firing rate, λ_{i}(*t*), with the prediction of the inferred model for an example cell. By computing the Pearson correlation among these two temporal traces (*ρ* = 0.87), we can estimate the performance of the model for each cell. In Fig. 3B we compare this performance with the reliability of the retinal spike activity, estimated as the correlation between two disjoint subsets of responses to the repeated stimulus and found that they were comparable. In Fig. 3C we show how the model predicts the empirical stimulus covariances. Even if a small underestimation is present, the model accounts for more than 84% of the empirical value (slope of a linear fit).

### E. Complete population model

The final step is to combine the single cell model and the interaction network, as explained in Sec. IIE. In Fig. 4 we compare the performance of the population (1) and conditionally-independent (4) models on a testing set that was not used for learning. In panels A and B, we check that inferring the population model still preserves the quality of the prediction of single cell activity obtained with the independent model. We compare the performance of the two models in reproducing the firing rate of the recorded cells (same quantity as Fig. 3B). Firing rate is a stimulus dependent quantity and accordingly the two models have similar performance. The fact that the population model’s performance is not degraded compared to the single-cell model validates the approximations made to calculate the corrections within the TAP approach. In addition, the fact that this performance is not improved is also a positive sign: it implies that the couplings do not try to compensate for failings of the encoding model, and only reflects noise correlations. Panel B shows that the two models have similar performance also for spike count variance.

However, the population model largely outperforms the conditionally independent model in predicting the population joint activity. Fig. 4C shows how the population model accounts well for noise covariances on a testing set (blue points). By construction, the conditionally independent model predicts vanishing noise covariance (red points).

### F. Robustness of the inference of the couplings to the choice of stimulus

A major challenge with fitting complex encoding models to neural responses is that they rarely generalize well. Here we ask if the interaction network can be inferred from a qualitatively different stimulus and then applied to our two-bars stimulation. To test for this, we infer the couplings as in Sec. IIIC, but on the response to repeated whitenoise (random checkerboard) stimuli. In fig. 5A we compare the couplings inferred from the whitenoise stimulus, to the couplings inferred from the two-bar stimulus. We then use the coupling matrix **J** learned on the whitenoise stimulus in a complete population model for predicting the response to two-bar stimuli (fitted following Secs. IIID and IIIE). In fig. 5B we show that this model predicts noise covariances when applied to the two-bar testing set.

This demonstrates that our inference method allows us to generalize from one stimulus type to the other, and that the inferred couplings between neurons are invariant to the stimulus.

## IV. DISCUSSION

We have introduced a model for the spiking activity of population of sensory neurons responding to a stimulus, in which extrinsic and intrinsic correlations are clearly separated. Our method is general and could be applied to other structures than the retina. It could also be extended to models where the influence of the stimulus on single cell activity is described by different non-linear models than illustrated here, e.g. a deep network with more layers [35].

Our inference strategy allows us to infer couplings between neurons that only reflect noise correlations between neurons, without the interference of stimulus effects due to artifacts of the learning procedure. Such effects can arise when the inference procedure tries to use the activity of other cells as a surrogate of the stimulus itself to improve the predictability of a given cell, compensating for non-linearities in the stimulus dependence that are unaccounted by the model. The inferred couplings thus show a weak dependence on the stimulus ensemble driving the neuronal response, in contrast to previous attempts where the stimulus ensemble had a major influence on the couplings [5]. Note that checkerboard and two-bars stimulations drive very different responses of the retina. It is thus a remarkable result that noise correlations in the response to a complex film of moving objects can be predicted by couplings inferred from responses to white-noise stimuli. This result can thus be seen as a first step toward the construction of a full model that accounts for large and heterogeneous stimulus ensembles.

Having models that account for noise correlations is a crucial first step to study their impact on coding complex stimuli. This impact can be quantified by comparing the population model to the conditionally independent model (*J _{ij}* = 0), or to alternative population models with different coupling matrices. By comparing the computations performed by these models, future works will be able to understand how these fast noise correlations affect coding in the retina. The same strategy can be used in other sensory structures or any other noisy input-output systems.

## Appendix A Generalized linear model analysis

We fitted a generalized linear model (GLM) to the responses of *N _{r}* = 25 off-cells in response to the moving bar stimuli. We discretized the response and stimulus using temporal bins of length 1.667

*ms*(i.e. 600Hz; 10 times smaller than the temporal bins used for the model described in the main paper). The response of the

*i*neuron at time

^{th}*t*is described by an integer

*n*(

_{i}*t*) denoting the number of spikes fired by the

*i*neuron in the

_{th}*t*time bin. The spatial location of the presented bar stimuli were discretized into

^{th}*N*= 100 equi-sized bins. The stimulus at time

_{x}*t*was denoted by a binary vector

*x*, of which the component (

_{t}*x*(

_{i}*t*)) was set to 1 if one of the bars was centred on the corresponding spatial location at time

*t*, and 0 otherwise.

In a GLM, the spiking response of each neuron is assumed to be Poisson distributed, with the mean number of spikes of the *i _{th}* neuron at time

*t*given by exp (

*r*(

_{i}*t*)), where: In this equation,

*b*denotes a constant bias term,

_{i}*w*is an element of the temporal stimulus filter (of size [

_{ijk}*N*])), and

_{r}, N_{x}, N_{w}*v*is an element of the recurrent filter (of size [

_{ijk}*N*])). We used a stimulus filter of length

_{r}, N_{r}, N_{v}*N*= 200 (i.e. ~300ms), and recurrent filters of length

_{w}*N*= 15 (i.e. ~25ms).

_{v}Model parameters were fitted by maximizing an objective function consisting of the log-likelihood (which, for a Poisson distribution, is given up to constant by minus an L2 norm regularization term that promoted smooth filters and thus prevented overfitting. The regularization parameters were chosen to maximize the log-likelihood on test data, held out during model fitting. Further, to reduce the number of parameters, and thus further reduce over-fitting we assumed that the stimulus filter could be described as the sum of *N _{rank}* = 3 spatio-temporally separable filters, each given by a spatial filter multiplied by a temporal filter (i.e. , where

*u*and

*a*here denote the spatial and temporal filters, respectively). Relaxing this assumption (by increasing

*N*) did not improve the quality of the model fit.

_{rank}We evaluated the performance of the GLM model in fitting the covariances between the responses of pairs of neurons (Fig. 6A). Interestingly, despite giving a reasonable fit of the total response covariances (Pearson correlation = 0.87), the model gave a poor fit of the noise covariances (compare Fig. 6B with Fig 4C). We wondered whether this could be because the recurrent filters, *v*, did not just capture interactions between neurons, but also compensated for the inability of the feed-forward filters, *w*, to fully capture the effects of the stimulus on neural firing rates. To see if this was the case, we compared the coupled ‘population model’ described above, and with firing rates given by equation A1, with an uncoupled ‘conditionally-independent’ model, where the recurrent filters, *v*, were set to zero. We found that the coupled GLM model resulted in improved predictions of the recorded PSTH for nearly all recorded OFF-cells, compared to the uncoupled model (Fig. 6C-D). This suggests that, rather than just fitting the interactions between different neurons, the coupled GLM model used the recurrent filters, *v*, to improve the prediction of how each neuron responded to the stimulus. However, it remains to be seen whether further differences between the GLM and the Ising model (e.g. time-dependent recurrent filters, Poisson distributed firing rates) could also contribute to their different performances in predicting the noise covariances.

## Appendix B Construction of the mean-field theory and Thouless-Anderson-Palmer correction

We are interested in computing the TAP correction to the fields *h _{i}*(

*t*) due to the addition of the coupling term J, see Eq.(1). Because we are not interested in the TAP expression for couplings nor in that for covariances, we can construct the mean-field theory for a single time-bin. Otherwise, because the couplings are constant in time we should have considered the whole model. To apply the Plefka expansion we introduce: where we neglect the

*t*dependence of

*H*

_{1}because here we focus on a single time-bin. The Legendre transform of (B1) reads: where is defined implicitly from: where is the average with respect to the distribution related to the free energy (B1). Our goal is to expand

*G*(

*α,*) in power of

**λ**, j*α*up to the second order. At first we evaluate the derivatives: where to obtain the last equality we applied the implicit function theorem to Eq. (B3). The Plefka approximation consists in estimating

*G*[

**λ**,

*J*] =

*G*[

*α*= 1,

**λ**,

*J*] from the expansion around

*α*= 0 evaluated at

*α*= 1: We need

*G*[

*a*,

**λ**,

*J*] and its derivatives at

*α*= 0. To this aim we note that , as for

*α*= 0 the system units become independent and consequently depends only on λ

_{i}. For

*α*= 0, in fact, the distribution over factorizes over a set of single variable distributions. this allows us to compute model expectations at

*α*= 0. For future convenience, we define the moments of such distributions: so that the terms in the expansion become: where

*W*has been defined in Eq. (7).

The mean-field equation for the fields ** h** can be easily obtained by a reverse Legendre transform of Eq. (B7):
which provides the expression (5) for the TAP correction.

## V. ACKNOWLEDGEMENTS

We like to thank C. Gardella for useful discussion. This work was supported by ANR TRAJECTORY, the French State program Investissements d’Avenir managed by the Agence Nationale de la Recherche [LIFESENSES: ANR-10-LABX-65], a EC grant from the Human Brain Project (FP7-604102)), NIH grant U01NS090501 and AVIESAN-UNADEV grant to OM.