## Abstract

Understanding the mechanisms of neural computation and learning will require knowledge of the underlying circuitry. Because it is slow, expensive, or often infeasible to directly measure the wiring diagrams of neural microcircuits, there has long been an interest in estimating them from neural recordings. We show that even sophisticated inference algorithms, applied to large volumes of data from every node in the circuit, are biased toward inferring connections between unconnected but strongly correlated neurons, a situation that is common in strongly recurrent circuits. This e ect, representing a failure to fully “explain away” non-existent connections when correlations are strong, occurs when there is a mismatch between the true network dynamics and the generative model assumed for inference, an inevitable situation when we model the real world. Thus, effective connectivity estimates should be treated with especial caution in strongly connected networks when attempting to infer the mechanistic basis of circuit activity. Finally, we show that activity states of networks injected with strong noise or grossly perturbed away from equilibrium may be a promising way to alleviate the problems of bias error.

## Introduction

Fully understanding the mechanisms of computation and plasticity in neural circuits requires knowledge of how neurons with specific functional properties are connected. Despite groundbreaking recent developments in direct circuit tracing^{1–6}, obtaining connectivity data is difficult, expensive and slow. Hence the interest in developing and applying statistical methods to estimate connectivity directly^{7–10} from simultaneous activity recordings of many neurons within a circuit^{11–13}.

A crude estimate of coupling strength between neurons can be made from correlations but, as is widely appreciated, correlations can arise from direct synaptic connections or a common input. Statistically sophisticated inference techniques – such as maximum entropy-based inverse Ising inference^{8;14–17}, *l*_{1}-regularized logistic regression^{18}, and generalized linear models^{7;19–22} can ‘explain away’ correlations that arise from a common observed input, at least when all recurrently connected neurons are observed and the inference model exactly matches the model from which the data are drawn. This explaining away allows for deletion of a direct coupling that might have been drawn between a pair of neurons based on correlation alone (Fig. 1a), yielding connectivity estimates that are substantially sparser than the raw correlation graph.

These methods have been applied to data from low-level sensory circuits in the brain with excellent success in producing improved predictions of neural responses to stimuli, since they account not only for the influence of the stimulus but also for the collective influence of other cells in the network to neural activity^{7;8;17;23;24}. Unfortunately, with the success of these models in activity prediction, it has been a common temptation, to which many have succumbed, to also interpret the inferred connectivity as biological connections^{7;25–29}.

Consistently, the match between actual and inferred connections remains untested, even as the inferred connectivity continues to be referred to (by optimistic researchers) as a reasonable proxy for actual connectivity. There are two key requirements for the success of these methods in inferring connectivity, neither of which tend to hold in reality: all nodes must be observed, and the statistical model used to estimate connectivity must match very closely the model from which the data are generated. There is widespread recognition of the problem of unobserved nodes (frequently though it might be forgotten when interpreting inferred connections); therefore, we focus here on the second problem which arises from model mismatch. We show that the problem of model mismatch (and also partial observability of nodes) is large for recurrent circuits with strong weights.

We consider a natural question: When do statistically inferred weights in even fully observed circuits reflect true connections between neurons, and under what conditions can we expect the algorithms to (not) perform well in deducing biological connectivity? We hypothesize that in strongly recurrent systems, emergent phenomena – related to pattern formation and the presence of strong long-range correlations – present a fundamental challenge to the problem of explaining away in real data.

To study the question, we construct a simple structured neural network model with local recurrent connectivity whose strength can be dialed between weak (“sensory”) and strong (“memory”) regimes, and which in the latter regime produces emergent activity patterns with strong correlations between unconnected cells. The structured form of the connectivity is for convenience of easy visualization of patterns of error in the inference procedures. Structured connectivity makes it easy to visually distinguish between between activity correlations, inferred connectivity, and explaining-away errors relative to the generative ground-truth circuit. As we show later, results in the structured model generalize to unstructured (random) networks.

## Results

### Generative network architecture and dynamics

Our generative model is a one-dimensional neural network of threshold-crossing spiking units interacting through rotation-invariant inhibitory recurrent connections **W** in a Mexican-hat profile on a ring (see Fig. 1b-c). Coupling is local: each cell inhibits nearby cells, but cells that are further away are unconnected. In addition to the recurrent input, all cells receive a uniform excitatory drive with a multiplicative noise component (see Methods). Cells generate spikes, which are used to drive the dynamics in the next time-step. The relative contribution of the recurrent inputs (relative to a background noise) can be dialed using a scalar weight strength *r* that multiplies the connectivity matrix **W**. When *r* is weak, network activity is noise-driven and un-correlated, but increasing *r* transitions the network into a strongly interacting regime where it exhibits self-sustained periodic activity patterns. The pattern consists of multiple bumps, and unconnected nodes in different bumps exhibit strong long-range correlations (Fig. 1d).

In all that follows, inferred weights are compared with the true weight matrix after they are rescaled to obtain the best match (see Methods), thus changes in the size of inference error cannot be attributed to a simple mis-scaling of the inferred weights. Similarly, inference across weight strengths is performed on the same number of total spikes in networks with weak and strong weights, and inference differences cannot be attributed to different amounts of data.

To first quantitatively and qualitatively relate the weight strengths (*r*) to operating regimes of networks in the brain, we consider the effect of *r* on noise correlations, activity pattern coherence, and the stability of internal states over time.

Increasing the strength *r* of recurrent connections leads to stronger noise correlations (see Methods, Fig. 1e). These correlation values can be compared directly with measured values of the mean absolute pairwise noise correlation in various sensory and non-sensory brain areas including V1^{30–32}, V4^{33;34}, MT^{35–38}, and hippocampus ^{39} (Fig. 1e, colored bands). As seen, medium to medium-large values in our range of *r* produce noise correlations consistent with primary and non-primary sensory processing stages in cortex; large values correspond to noise correlations from CA1 and CA3, areas associated with strong recurrent processing and memory. Medium-low and low values of *r* might correspond to weights that are weaker than found in even primary sensory areas because the correlations are smaller than seen in primary visual cortex. (But note that this comparison does not take into account the possibility of correlated noise inputs into primary sensory cortex^{40–42}; thus we also use more qualitative measures, below, to relate values of *r* to different operating regimes in the brain.)

The structure of the noise correlation matrix, not just the size of the noise correlations, itself evolves with *r*: At weak *r*, where neurons are largely noise-driven, noise correlations are small and the noise matrix exhibits high variance with weak signatures of the true connectivity matrix (Fig. 1f, first panel). At intermediate *r*, the influence of noise decreases and activity correlations better reflect connectivity (1f, second panel). As *r* continues to increase, however, pattern formation sets in and noise correlations no longer reflect the true weights, capturing instead the correlations induced by the emergent periodic activity pattern (1f, last two panels).

Qualitatively, different values of *r* move the network along the sensory-memory continuum, defined by the coherence and diffusivity (see Methods) of activity states. Coherence is a measure of the fidelity of the shape of the activity profile over time (regardless of where the activity is centered); as *r* is increased, the network moves from a regime with zero pattern coherence to one with maximal coherence (Fig. 1g). The activity pattern itself drifts over time due to noise; this drift takes the form of a non-restorative random walk (OrnsteinUhlenbeck process), which we quantify by its diffusivity (see Methods). As weights increase in strength, the patterns become increasingly stationary and resistant to noise-induced drift, thus the diffusivity drops sharply (Fig. 1g). When diffusivity is low, the initial pattern phase is not rapidly lost and the phase can be used as a memory^{43}. In Fig. 1g, the horizontal line marks the diffusivity value at which the RMS spread of the pattern phase equals half the pattern wavelength after 0.5 seconds; the starting phase is on average completely forgotten on this time-scale. The weight strength at this point is *r* = 0.0125; because both coherence and diffusivity change sharply around this value, we take it as our working boundary between the ‘sensory’ and the ‘memory’ regimes of the circuit. The sensory regime close to the memory boundary is highly-amplifying with slow responses.

### Inference at different weight strengths

We can now quantify the quality of circuit inference along the sensory-memory continuum defined above, beginning with the best-case experimental scenario, in which every neuron in the circuit is observable. We consider fully-observable data to high-light inference problems that arise in addition to the already well-known problems of inference in partially observed neural circuits.

Consider a dataset of spikes generated from the dynamical neural network model, with 10^{8} total spikes from the network at each value of *r* (see Methods). Because of its good performance in circuit activity prediction, at least at the sensory periphery ^{7}, we apply a generalized linear model (GLM) to the data to extract an estimated weighted connectivity matrix, then measure inference error as the normalized *l*_{2}-distance between the ground-truth and the properly-rescaled inferred weight vectors (see Methods).

When weights are weak, the recurrent connections have a small effect on neural activity relative to the ongoing noise, thus the signal-to-noise ratio (SNR) is low and the inferred connectivity exhibits uncorrelated errors (Fig. 2a, first panel), similar to the effects seen in the noise correlation matrix. The SNR improves, and inference error decreases with increasing weights (Fig. 2a, second panel), but only upto a point. Beyond this point, the inferred connectivity matrix begins to exhibit a new kind of error, visible as side-bands in the inferred matrix, last two panels of Fig. 2a (the distribution of inference errors, Supp. Info. §S.9 and Supp. Fig. S3 also shows a change in the pattern of errors). These errors are systematic (biased) overestimates of the existence of connections and magnitudes of weights between unconnected neurons. They result from a partial failure to explain away strong correlations, which are induced not by direct connections but by the emergence of stable activity patterns at the population level. Quantitatively, the GLM estimate is closer to the true connectivity matrix than is the raw noise correlation matrix, but the GLM estimate nevertheless exhibits a similar qualitative pattern of errors (compare Figs. 1f and 2a).

At any weight strength, the total inference error Δ can be decomposed into two orthogonal vector components: a variance Δ_{v} and a bias Δ_{b} (see Methods). As the weights increase in strength, the inference error vector rotates from being variance-dominated to bias-dominated (Fig. 2b). The total error is smallest for intermediate weights, when variance and bias contributions are both relatively low (Fig. 2c). The point of smallest error is far to the left of the sensory-memory boundary, in the very low weight regime (cf. regimes defined in Fig. 1e,g). When the inference model is not exactly matched to the data-generating model, as is typical, the minimal total error point is not an invariant quantity that reflects a critical point of the network dynamics or the inference process. Rather, the optimal inference point depends on the volume of data used for inference, as we show next.

### Variance but not bias errors decline with added data

The variance and bias errors behave differently as the volume of data available for inference is increased: examples at opposite ends of weight strength, where variance versus bias errors predominate, respectively, are shown in Fig. 3a, left versus right (total squared inference error is the sum of the squares of the variance and bias terms, ). Data volume is defined as the total number of spikes used for circuit inference; because the average firing rate is held fixed, data volume is proportional to the time needed to collect the data. At the weakest weights, where error is nearly purely due to noise and thus variance-dominated, the error decreases inversely with data volume, as expected: , Fig. 3a inset at bottom-left with error well-fit by a line of slope −1 on a log-log plot (see Supp. Fig. S4a with fits and confidence intervals). Substantial bias errors arise for stronger weights, and these do not shrink with data volume, Fig. 3a, right (green area). Bias errors persist with data volume because they arise from highly-correlated neural activity states that are themselves highly structured and persistent.

It is clear from the full surface plots of total error (Fig. 3b), the relative fraction of bias errors (quantified by the normalized angle *θ*_{b} between the total error vector and the bias axis, Fig. 3c), and the bias and variance errors (Fig. 3d-e) that at all weight strengths the variance error decreases readily with added data while the bias term remains impervious. The bias error is therefore the asymptotic inference error in the limit of infinite data. Given that added data erodes the variance component of errors while bias errors are maintained, the point of minimum inference error should steadily move leftward with increasing data volume, as observed in 3f (dashed black line with arrow).

When using matched models for the data generation process and inference (e.g. using the Ising model for both), a scenario that is empirically implausible since we can never know the real model from which the data are generated, inference errors are simply variance errors at all weight strengths, and decay with data volume according to the same power-law (see Supp. Info. §S.1). In this case, the weight strength for best inference is an intrinsic critical point of the network (corresponding to maximal magnetic susceptibility in the Ising network, which in turn is directly related to the maximization of Fisher Information ^{44}).

### Data required to perform accurate inference in memory networks is infeasible

From a practical or experimental perspective, the relevant question is how much activity data must be collected from a circuit to obtain an estimate of its connectivity at a specified precision, and how this value differs for sensory versus memory circuits.

Thus, we fix the desired inference error (to 0.1194, corresponding to the inference error at *r* = .0025, horizontal line in Fig. 3f), and determine the data volume required, on average across multiple subsamplings of the data, to fall within 1% of this value. This process is equivalent to plotting a slice through the surface of Fig. 3b at a specified error level.

The data required to achieve a given inference error grows toward both extremes of weak and strong weights (Fig. 3g, black dots). As weights decrease, the influence of the recurrent connections shrinks and the data required to recover any information about the connections grows, diverging in the limit of vanishing weights. At the opposite end, the data demand escalates more steeply, and appears to diverge at a *finite* weight strength. We empirically fit this data demand curve with a sum of two power laws (3g, black, red, and blue curves). The first has a form , diverging at *r* = 0, while the other diverges at some finite weight *r*_{max} determined by the fit, and has the form . The best-fit yields the exponent *n*_{w} ≈ 2 for the first component, and for the second, *n*_{s} ≈ 1.4, diverging at *r*_{max} = 0.0157 (Fig. 3g, brown line). A network with stronger weights than *r*_{max} = 0.0157 cannot be inferred to the desired accuracy with *any* volume of data.

We can better understand the divergence of the data demand curve at a finite strong weight by plotting the obtained *r*_{max} on Fig. 3f: it corresponds to the point where the extrapolated infinite-data inference error (green curve) crosses past the criterion error (horizontal line), due to data-intractable bias errors at stronger weights.

### Results generalize across generative and inference models

In addition to using GLMs, we perform inverse inference on an Ising model (which corresponds to the maximum entropy model that fits the data means and covariances under the assumption of binary responses, see ^{45}) using various techniques to obtain an estimate of connectivity from activity. Exact inverse Ising inference on a generalized Ising model (all-to-all connectivity with unequal weights) is NP-hard ^{46}, requiring large amounts of data and intensive computation. Different algorithms fall at different points on a speed-accuracy tradeoff curve: Boltzmann machines perform optimal inference in the limit of large amounts of data if the data comes from the inference model class, but are slow. The simplest mean-field approaches are fast but suboptimal in accuracy.

We used an algorithm called minimum probability flow (MPF)^{47}, which provides approximately optimal solutions at intermediate computational complexity and guarantees convergence to the correct (maximum likelihood) parameters in the asymptotic data limit.

Binarizing the data, as required for the method, results in information loss and poorer inference than GLMs and logistic regression (see Supp. Info. §S.3, Supp. Fig. S2). We performed inference at various bin sizes, and display results for the minimal-error result relative to the ground truth (Supp. Info. S.3, Supp. Fig. S2). Selection of the best result (and thus the best bin-size) cannot be done when the ground-truth is unknown, as is usually the case when doing circuit inference, but here it provides an upper-bound on the performance of such methods. (The best bin size, across weight strengths, is approximately the time-scale of single-neuron integration, suggesting that this is a good choice in general.)

At the other end of the speed-accuracy tradeoff for inverse Ising inference are mean-field methods, which are fast approximations. Among these we implemented the naíve mean field, the Thouless-Anderson-Palmer (TAP)^{48}, and Sessak and Monasson (SM)^{49;50} approximations (see Supp. Info §S.4).

The naïve mean field method simply uses the negative inverse of the activity covariance matrix, so we consider it as the ‘raw correlations’-based connectivity estimate. This serves as a lower benchmark for circuit inference, and is plotted on Fig. 4a using spike count data as a comparison against the GLM and logistic regression (described next), and using binarized spike data as a comparison against the rest of the inverse Ising methods.

Next, we consider logistic regression and *l*_{1}-regularized logistic regression^{18;51;52}, in which the response of each neuron is regressed onto the activity of all the rest. The response variable must be binary, but the predictor values can have non-binary counts, thus we only binarize the spike data of the response neuron. Logistic regression achieves slightly better performance than the GLM, Fig. 4a.

Since imposing an *l*_{1} penalty on the inferred weights prioritizes sparseness in the inferred weights^{53}, and thus might push the network toward pruning or shrinking weights that make up the bias errors (the off-diagonal side-bands of Fig. 2a), we apply *l*_{1}-regularized logistic regression^{18} (and tune the size of the regularization parameter at each weight to optimize inference accuracy, see Supp. Info. §S.8). However, *l*_{1}-regularization does not eliminate side-bands and barely improves inference quality relative to unregularized regression (Supp. Fig. S5). Rather, *l*_{1}-regularization forces all small weights to zero and retains larger weights, truncating or steepening the flanks of all peaks in the inferred weight matrix while retaining all peaks (including the side-bands).

When the recurrent weights are small, all inference methods, regardless of their statistical sophistication, perform equally well, and no better than the raw activity correlation matrix, at estimating connectivity. The main difference in the quality of the inference methods at low weights is the expected better performance of models that take spike counts rather than binarized spike trains into account, Figure 4a-b (solid versus dashed lines).

All methods replicate the qualitative trend of large inference errors at weak and strong weights, with best inference in the middle. At strong weights, all methods (with some mismatch) yield large bias errors that are impervious to additional data, Fig. 4b.

We switch generative models (while retaining the network architecture) then again perform inference, using a GLM with an exponential nonlinearity to fit data from both a linear-nonlinear-Poisson model and a GLM with rectifying nonlinearities. These produce the same qualitative patterns of error (see Supp. Fig. S6), showing that the results are not specific to a particular model for generating data.

Finally, the described effects are not specific to structured, low-dimensional ring-like circuit architectures, which we used to readily illustrate the pattern of bias errors. We consider next a network with all-to-all random symmetric connectivity. Connection strengths are drawn uniformly and randomly from the same range as the Mexican hat weight profile of the ring network. Circuit inference (here, using a GLM) displays the same patterns of errors as before, 4c (plots show the absolute values of actual and inferred weights, with inferred weights normalized according to two choices: minimizing the *l*_{2} or the *l*_{1} distances between the inferred and true weights).

Inference in the weak-weight regime is good, with inferred weights falling close to the unity line, but the strong-weight inferences deviate strongly from the true weights. Bias errors in connectivity estimation reside in the widespread scatter of inferred weights above the high-density blue region. These errors are reflected in the correlation coefficients of inferred vs true weights, which is a scale-independent measure of fidelity (*ρ* = 0.99 vs *ρ* = 0.32 for weak and strong weights, respectively, Fig. 4c).

In sum, the problem of overestimation of connectivity through bias errors at weight strengths that correspond to strongly amplifying sensory networks and to memory networks generalizes across models for data generation and inference, and across network connectivity architectures, whenever there is some mismatch in the inference model relative to the model generating the data (see Supp. Info. S.1 for inference in which the data generation model and the inference model are perfectly matched).

### Sampling data during non-equilibrium dynamics improves inference in strongly recurrent circuits

We consider what happens if activity data are sampled when the network is out-of-equilibrium. We initialize a network with strong weights (*r* = 0.025) far out of equilibrium, with low activity across neurons, Fig. 4d. The network states gradually evolve until they transition into an equilibrium activity pattern with strong correlations. As the states evolve, we collect activity data. Using the increasing amounts of data collected as the states evolve, but before pattern onset, improves inference, 4d. Interestingly, the slope with which inference improves with data volume is steeper in this non-equilibrium strong-weight network than for the in-equilibrium weak-weight network (due to higher SNR). Once the pattern emerges, however, the activity correlations contribute to bias error, and continuing to collect data in this regime now hurts inference, resulting in a worse match with the true weight matrix than if we used data collected only before pattern onset. This result shows that non-equilibrium activity data might result in much better inference results in strongly coupled recurrent neural networks ^{54}.

### The circuit identification problem is inherent to activity states in strongly coupled circuits

The problem of circuit mis-identification when network weights are strong can be attributed to the very nature of the activity data in this strongly-coupled regime. Specifically, in the strong-weight regimes of strong sensory amplification or memory dynamics, the inherent distinguishability of spiking data generated from the true model and other candidate models is much smaller than when connections are weak. In other words, if different circuits generate convergently similar activity patterns, there is no inference method that would distinguish between them. Specifically, if the same neural network model generates data according to the true circuit and the mis-identified inferred circuit, how inherently similar or distinguishable are the resulting spike patterns?

Consider two weight matrices, **W**, our true circuit with only local synapses, and **W**′ with non-local synapses (Fig. 5). (**W**′ was obtained by inverse Ising inference on spike data generated by a dynamical neural network using the weight matrix **W** in the strongly recurrent regime, see Methods.) Our question is then: how inherently distinguishable are **W** and **W**′ in terms of the activity they produce at different recurrent weight strengths, independent of any inference procedure?

At each time-bin we record a vector ** σ** of spike counts from all the nodes of a network, a spike word, which we may call the observed network state at that time. The distribution

*p*(

**) of the states collected over time is the joint distribution of the activities of all the network nodes. The relative entropy (KL divergence) between the distributions produced by two circuits serves to measure the dissimilarity of the states they generate. This relative entropy measured between two parametrically interpolatable models is the Fisher Information contained in the data about the models (see Methods).**

*σ*Such measures are accurate only when the individual distributions are well-sampled, and the state space of *N*-neuron networks is vast when *N* is even moderately large (here, *N* = 100). We employ two strategies to make the problem tractable: we consider only binarized spike data (Supp. Info. S.3), and instead of considering length-100 spike words, we consider spike words of smaller, 10-neuron segments of the network, whose lower-dimensional distribution we can sample sufficiently well (see Methods).

The distributions (relative frequencies) *p***W**(** σ**) and

*p*

**W**′(

**) of the 2**

*σ*^{10}different spike states

**of 10-neuron segments from the two circuits at different weight strengths are shown in Fig. 5c. The states are sorted along the horizontal axis according to their frequency in**

*σ**p*

**W**(

**); the heights in green and yellow at a given point along the abcissa denote the relative frequencies of the same states**

*σ***in the activities of the two circuits.**

*σ*When weights are weak, activity in both circuits is driven almost exclusively by the noisy feedforward input, thus state probabilities are very similar and flat in both models *p***W** and *p***W**′; the entropy of the states is the maximum of 10 bits (Fig. 5d). As recurrent strength is increased, both distributions contract in state space and their entropies drop. This occurs earlier in the **W**′ circuit because the added non-local recurrent couplings drive an earlier onset of strong correlations or patterning; thus *p***W**′ is more narrowly peaked (note logarithmic vertical scale) than *p***W** for weaker weights, and the two circuits are distinguishable. As the weights continue to increase, pattern formation catches up in **W**, while the state space of both models continues to shrink. However, the set of activity patterns and thus the state distributions converge again at strong weights, meaning that the two models become inherently less distinguishable, at least under the dynamics of the data-generating model. The same result holds when considering larger segments of the network (up to 22 nodes; see Methods and Supp. Fig. S7).

These distributional effects can be quantified with three closely related metrics, the relative entropy, the likelihood ratio of **W** vs **W**′ given the activity data *p***W**, and the Fisher information. (This measures the sensitivity with which one can determine the value of a continuous parameter in a generative model from the data. The parameter here is the linear interpolation variable *θ* in a generative circuit model where the recurrent connections are given by *θ***W** + (1 − *θ*)**W**′, see Methods). Larger values of these metrics indicate that data produced by the two circuits are inherently more distinguishable. It is clear from these measures that the alternative models are maximally distinguishable close to the sensory-memory transition, at the onset of patterning, and become more indistinguishable (for a fixed volume of data) at both stronger and weaker recurrent weights. This result is consistent with findings that the volume of statistically distinguishable models is maximal near the critical points of a system and that models far from the critical point are difficult to tell apart using their data^{44}.

To summarize these results, when weights are very weak, the occupied state spaces of competing models are essentially as large as the full state space of uncoupled neurons, and thus are fully overlapping and indistinguishable, (Fig. 5f, left). At intermediate weights, the occupied state space shrinks for each model, as does their overlap (Fig. 5f, center). At high weights, the state spaces of the competing models continue to shrink, but instead of growing more distinguishable they shrink toward a common point and their overlap grows again (Fig. 5f, right).

Qualitatively, the distributional convergence of activity states for the competing models shows that it becomes intrinsically difficult, for any algorithm, to distinguish between them at high recurrent weights, consistent with the findings above for various specific inference algorithms. Quantitatively, however, each of the specific inference algorithms suffers from an additional effect when there is model mismatch: because the bias errors do not shrink with data volume but the variance errors do, the best inference, depending on data volume, usually occurs at weaker weights than the point of optimal inherent distinguishibility (were the inference and generative models matched).

### Strong weights greatly exacerbate inference error in partially observed networks

In partially observed networks, it is impossible in principle, without additional assumptions, to infer the existence of a connection between a pair of neurons from observed activity because of the possibility that observed correlations are due to an unobserved common input (e.g. Fig. 1a, if the neuron at the bottom were not observed). To decouple these known problems of inference in partially observed networks ^{55–61} from the more subtle but important problems of inference in recurrent circuits with strong weights, we considered above the fully observed setting. Now we examine how errors that already result from inference in partially observed networks are affected by the strength of weights within the circuit.

As before, we apply a GLM to infer circuit connectivity. This time, we use activity data from only a subset of neurons to reconstruct connectivity within that subset. We repeat this process over multiple subsets of the same size, and “merge” the inferred sub-circuits together (for each neuron pair *i*, *j*, we average the inferred weights *Ŵ*_{ij} from all sub-circuits that contained that pair) to obtain a complete *N* × *N* circuit that can be visualized. For fixed, weak weights, as the observed fraction shrinks, the side bands become more pronounced (Fig. 6a, left two panels), illustrating that partial observation leads to qualitatively similar patterns of bias error as when weights are strong. Similarly, for a given observed fraction, increasing the weight strength produces stronger side bands (Fig. 6a, second and fourth panels).

Quantitatively, at fixed data volume the inference error rises linearly as the observed fraction shrinks. Increasing the weight strengths causes an increase in the rate at which circuit inference deteriorates with the observed fraction (Figs. 6b, c, d). In other words, the errors introduced by partial observation are milder when the weights are weak, and bigger when they are strong, even when the architecture is fixed.

The increase in the slope of inference error versus observed fraction when weights are larger is due to mounting bias errors, Fig. 6e. At the weakest weights, inference quality in the 50% observed network nearly matches the fully observed case (Fig. 6f, left), but the gap between the two is large at strong weights (Fig. 6f, right), and the gap, due to bias errors, is largely immune to data volume.

## Discussion

We have shown that estimates of functional or effective connectivity in a network obtained from activity at the nodes can diverge substantially and in systemically biased ways from actual connectivity in a recurrent circuit if the weights are strong. Our definition of strong weights encompasses not only circuits that can hold states without external inputs, as required for memory, but also sensory circuits which moderately or strongly amplify their inputs. Note too that although the issues we illustrate were in the context of inferring connectivity within a network from spike data, they may arise more generally in other domains as well, such as when considering other kinds of activity data or connectivity across networks, such as when inferring area-wise connectivity from fMRI activity data.

The divergence between estimated and true connectivity cannot be addressed by collecting more data because the explaining away errors found in strongly coupled networks are biases that do not shrink with data volume, as do noise-based variance errors which dominate at weak connectivity. Moreover, estimates of functional connectivity by sophisticated inference algorithms in the strongly coupled regime are generally not much better than simple inverse correlation, except in the strongly connected regime where they both provide relatively poor estimates of the true connectivity. These results point to the need for considerable caution when using activity to construct an estimate of structural connectivity in even moderately strongly coupled recurrent circuits.

Arguably, it is sufficient for many purposes to merely infer functional connectivity, without relating it to structural connectivity. This can be the case when the goal is to use activity-based inference to ultimately predict future activity, as done for instance in ^{7;8}. However, the functional connectivity in even these cases is then often referred to and used as a proxy for actual connectivity, and here systemic biases in estimation can lead to important errors in our understanding of circuit and circuit development mechanisms. For example, in our simple ring network, purely local connectivity can give rise to periodic activity patterns. One could imagine that developmentally, such a circuit with periodic activity patterns could arise without the need for activity-dependent plasticity. On the other hand, in a sideband model, the sidebands are related to the periodic activity patterns, suggesting that activity must have lead to connectivity, a developmentally distinct process.

Despite the pessimistic results shown here, the challenge of discovering the connectivity in recurrent neural networks is difficult but not unsurmountable: rapid advances in automated segmentation for connectomics mean that one can imagine obtaining connectivity matrices for complete circuits in the not-too-remote future. However, the costs remain large, and determining the link between structure and function after obtaining a connectivity matrix still requires an inference step or model of how activity emerges from connectivity. More immediately accessible and directly inter-pretable are experiments that rely on perturbation of the system. Clearly, if every single connection could be individually perturbed, that would provide detailed causal connectivity information.

But even much lower-dimensional perturbations can be helpful: Appropriately designed perturbations can help disambiguate between competing dynamical models with even subtle differences^{62–64}. More generally, given that variance errors can be averaged away, and that inference with strong noise is superior to inference with weaker noise given enough data (better inference when the noise to recurrent weight strength ratio is high) as we have shown, we expect that driving systems strongly with noise and then performing even simple correlational inference (with lots of data) should lead to much better estimates of true connectivity than when the system is not noise-driven even if sophisticated inference algorithms are used. In other words, studying a recurrent dynamical system with fixed points as it is continuously driven out-of-equilibrium by noise should be much more informative about its structure than studying it close to or at equilibrium ^{54}. We have additionally shown above that studying far-from-equilibrium dynamics in strongly coupled recurrent networks, e.g. by watching the network evolve from a fully silenced to patterned state, can provide data for much better inference in strongly recurrent networks.

Finally, when the inference model exactly matches the dynamical model generating the data, all neurons are observed, and the mapping from circuits to activity is injective, it can be possible to exactly estimate connectivity from activity (e.g. in Ising-on-Ising inference for certain architectures). While it is impossible for any inference model to exactly match the true neural circuit dynamics, and the mapping from circuits to activity states need not be injective, improvements in how well the inference models mimic the dynamics of the system can shrink the gap between functional and structure connectivity.

## Methods

### Generative network model

Here we outline the dynamical model that was used to generate the data in most cases. Other generative models that we used are described in the Supp. Info.

#### Circuit

Neurons are arranged on a ring. The outgoing synaptic weights *W*_{ij} from each neuron to all of the others around the ring have a local Mexican-hat (difference-of-Gaussians) shape:
where *d*_{ij} is the distance (in neurons) between neurons *i* and *j*. We set *a* = 1.0005 > 1 to make the weights purely inhibitory, to prevent self-excitation and allow dynamical stability. Parameters used: *σ*_{1} = 6.98 and *σ*_{2} = 7 (in neurons).

#### Neural dynamics

Dynamics are updated in discrete time, with a time-step Δ*t* of size 0.1ms. The vector of inputs to the 100 neurons is:
where the synaptic activations of neurons are given by ** s**(

*t*),

**W**is the recurrent connectivity defined above, and

**b**are the feedforward inputs. The relative influence of

**W**is scaled by the weight strength

*r*.

**(**

*b**t*) is given by:

**(**

*b**t*) =

*b*(

**1**+

**(**

*ξ**t*)), where

*b*= 0.001 provides a uniform excitatory drive, and

**(**

*ξ**t*) is a multiplicative private Gaussian white noise per neuron, with zero mean and s.d.

*σ*

_{ξ}= 0.3, resulting in a Poisson-like variance proportional to the mean activation. This noise is only injected with probability ≈ .07 in each time step of the discrete-time equations, since with more noise the dynamics loses coherence.

If at any time step (*t*, *t* + Δ*t*] the input *g*_{i} to neuron *i* exceeds threshold Θ, it emits a spike. ** σ**(

*t*) is the binary vector of spikes from the network at time step (

*t*,

*t*+ Δ

*t*]. The synaptic activation of neuron

*i*, through which it affects the other neurons in the circuit, is incremented by an amount Δ

*t*whenever that neuron spikes, and otherwise decays exponentially according to: with a synaptic time constant

*τ*= 10 ms.

#### Generating spike data

To move from the weakly to the strongly coupled regime, we increase the weight strength *r*. The threshold Θ is adjusted at each *r* to hold roughly fixed (within 16.0 ± 0.1 ms) the average inter-spike interval of the combined across-population spike train.

For each parameter setting, we initialize the network dynamics with random activations ** s** and wait for it to equlibriate, then collect a total volume of 10

^{8}spikes from the network.

### Characterizing the dynamical regime of the network

#### Noise correlations

To measure noise correlations across the network, we need to hold the signal fixed. Removing the feedforward input *b* from a neuron forces a valley of activity at its location, and holds the activity pattern stationary. We then generate spikes from the network with no further time-binning than the generative resolution (which would artificially amplify correlations by calculating them over time). We then calculate the correlation coefficient between the (unbinarized) spike count trains of every neuron pair, excluding the suppressed neuron. Strong positive and negative values both indicate correlated activity, so we take only the magnitude.

#### Activity pattern strength

The phase of the spatially periodic activity pattern randomly wanders around the network over time. But its degree of spatial periodicity will be captured by activity correlations regardless of its phase. We start with spike data that is not binned in addition to the generative resolution of 0.1 ms, and compute the correlation coefficient matrix between all neuron pairs (replacing the 1’s in the diagonal with 0’s to ignore self-correlation). Since the pattern can migrate around the network while the data is being collected, these correspond to signal correlations. A global spatial periodicity of the pattern is reflected as a periodicity in each row/column of this matrix. We calculate the degree of this periodicity in each row by averaging its autocorrelation at circular shifts of 1, 2 and 3 periods (the pattern has 4 periods around the ring), then averaging this across all rows. This measure will be near 0 for uncorrelated spike trains, and approaches 1 in the limit of a 4-period spatial pattern in spike-time correlations.

#### Diffusivity of activity pattern phase

Over time, the phase of the activity pattern migrates randomly around the network. To measure the diffusion coefficient of this motion, we choose the Fourier component of the vector of neural inputs ** g**(

*t*) with the spatial frequency of the activity pattern (4 cycles around the ring). The time-varying phase

*ϕ*(

*t*) of this component is the trajectory of the pattern phase. The slope of the linear fit of 〈(

*ϕ*(

*t*+

*τ*)

*ϕ*(

*t*))

^{2}〉

_{t}against

*τ*then gives the diffusion coefficient. For weights below

*r*= 0.0125,

**was too noisy to extract any periodic pattern, hence diffusivity was not computed for these.**

*g*### Measuring inference error

The ground-truth and inferred weights are the elements of **W** and **Ŵ**, ignoring the diagonals. Since the inference model is generally different from the generative model, **Ŵ** has an arbitrary overall scale factor with respect to **W**. So before calculating the inference error, the first step is to re-scale **Ŵ** to match the scale of **W**, which we do in the following way.

**W** is circulant, so each row is a rotation of the same Mexican-hat shaped weight vector ** ω**. Because of noise, though, the rows of

**Ŵ**are not exact rotated copies of each other. Each row

*i*is a different noisy estimate

*ω*_{i}of

**. We re-scale**

*ω***Wˆ**such that its average weight-shape estimate has the least

*l*

_{1}deviation from the true shape

**. Averaging the noisy**

*ω*

*ω*_{i}’s and using the

*l*

_{1}metric (which is more noise-tolerant than

*l*

_{2}) provide greater accuracy in the scale-matching in the face of noise.

However, this procedure cannot be applied in situations where the weight matrix to be inferred is not circulant. This is the case when inferring a partially observed circuit, (a sub-matrix of **W**, see Results, Fig. 6a,b), and also when inferring a circuit with a non-circulant, e.g. random architecture (see Results, Fig. 4c). In these cases we resort to the less noise-tolerant choice of re-scaling the inferred matrix **Ŵ** to have the least *l*_{2} distance from the ground-truth matrix **W**. When we apply this re-scaling, the inference error is also the sine of the angle between the ground-truth and inferred weight vectors in the space of weight parameters.

After **Ŵ** has been re-scaled, the inference error is the magnitude of the (*l*_{2}) distance between the vectors of ground-truth and inferred weights, expressed as a fraction of the vector magnitude of the ground-truth weights:

Although the *l*_{2} metric is more noise-sensitive as noted, we choose it here in the final expression, since its properties allow an elegant decomposition of the inference error into variance and bias components, as we describe in the next section.

#### Variance and bias errors

Let us denote the *N*-1 elements (ignoring self-coupling) of the common weight vector ** ω** of the circulant

**W**by

*ω*

^{α}, and their estimates on row

*i*of

**Ŵ**by . The squared inference error (eq. 4) can thus be expressed as:

This can be rewritten in terms of , the elements of the average estimated weight shape :
(where is the variance of the estimates of *ω*^{α})
(where is a circulant matrix consisting of rotations of the mean estimate )

, the *variance error*, is the error due to the variance among the different noisy estimates , the *bias error*, is the error due to deviation of the mean estimate from the true weights *ω*. We can geometrically interpret these quantities in the space of all *N* (*N* − 1) weight parameters. **W**, **Ŵ** and are three points in this space representing the true weights, inferred weights, and mean inferred weights. If we measure lengths in units of ‖**W**‖, the length of the gap between **Ŵ** and **W** is the total inference error Δ, and can be decomposed into the gap due to variance, from **Ŵ** to with length Δ_{v}, and the gap due to bias, from to **W** with length Δ_{b}. These two gaps are orthogonal, as evidenced by their Pythagorean relationship (eq. 9), and their zero inner product:

Hence, the total inference error is a vector sum of orthogonal components due to variance and bias.

The relative contribution of bias to the inference error can thus be measured by the angle between the vectors of the total inference error and the variance error, as a fraction of 90°:

These variance and bias errors cannot be computed for non-circulant circuits.

### Discriminating circuits using activity data

#### Constructing the alternative circuit

The circuit inferred at strong weights using inverse Ising needs to modified to be used as a generative circuit with the same dynamics as our original generative model.

We first calculate the mean weight shape vector of this inferred circuit by circularly shifting the matrix rows to align them, then averaging. Next, we set the positive parts of this mean shape to 0 (to prevent self-excitation and ensure stability under the generative dynamics), and re-scale to match its minimum with that of the ground-truth weight shape. Finally, we create a circulant matrix from this weight shape.

As with the original circuit, firing thresholds Θ for this circuit are tuned at each weight strength to set the average network inter-spike-interval at 16.0 ± 0.1 ms. We can now generate spike data from this circuit using the original dynamics.

#### Selecting a well-sampled subspace of neural activity

Even with binary spike data, there are 2^{N} possible spike words, or states, of an *N*-neuron network, prohibitively big for large *N* (in our case with *N* = 100, it is 2^{100} 10^{30}). Each time-bin of spikes we collect from the network is a state. The number of such states we obtain over 6 hours at a resolution of 0.1 ms, and binned at 10 ms, is only of the order of 10^{6}. Thus, even if in a given setting the network may produce only a subset of all possible states, they are still severely undersampled by data collected in any experiment of reasonable duration, and statistics to characterize the data distribution, such as entropy, will be biased^{65;74}.

We can instead choose a segment of *n* adjacent neurons from our ring network, whose smaller *n*-dimensional state distribution we can make sure to have sampled well with our data. But this fails to utilize the data we collected from the other neurons. We can address this by exploiting the rotation-invariance of our network, and its activity over time. Sliding our *n*-neuron window around the ring network, we can pool together the observed spike vectors of neurons 1 through *n*, 2 through *n* + 1 etc. This lets us use the full dataset to build a better-sampled distribution of adjacent *n*-neuron states of the network.

*n* should be chosen as large as possible while ensuring that its state distribution is sampled sufficiently in our data. This largest possible value is dictated by the particular statistical measure we want to accurately estimate, as explained in the following sections.

#### Characterizing neural activity using information theory

##### Entropy

We can verify that a distribution has been well-sampled for unbiased entropy estimation by checking that the entropies computed with increasing fractions of the data converge as we approach the total data volume. *n* = 22 is the largest value that allows this convergence at all weights (see Supp. Fig. S7). However, as described in the following sections, requirements for other measures constrained *n* to be lower.

##### Relative entropy

The dissimilarity between the data distributions *p***W** and *p***W**′, considering the former as the ‘true’ distribution, is measured by their relative entropy (KL divergence):
where ** σ** runs over the

*n*-neuron spike vectors that constitute the support of

*p*

**W**.

##### Fisher information

This can be used to quantify the amount of information about a generative model that is contained in the data it produces. It is measured by the sensitivity of the data distribution to changes in the model parameters.

It is not feasible to compute the Fisher information against variations of all the circuit parameters (the unique weights in **W**). Therefore, we consider instead a single-parameter family of models that passes through the true circuit **W** and the non-local circuit **W**′:

The Fisher information *I*(*θ* = 0) about the true model **W** = **W** (*θ* = 0) can be written in terms of the KL divergence between the data distribution *p*_{W} (= *p*_{0}) that it produces and the distribution *p*_{dθ} that a neighbouring model **W** (*dθ*) produces:
where ** σ** again constitute

*n*-length binary spike vectors.

We approximate this measure by constructing the circuit as a neigh-bouring model in the family. At each weight strength, we again adjust its firing threshold to maintain an average network inter-spike interval of 16.0 ± 0.1 ms as before, generate spike data, binarize counts, and compute the KL divergence with respect to the data distribution generated by **W**. The Fisher information is then approximated by:

##### Likelihood ratio of circuit models

The log likelihood that the observed data distribution *p*_{W} was produced by the circuit **W**′ is:
where *s* is the total number of samples in *p*_{W}.

We can collect a second sample from the local circuit to account for sample-to-sample variability, and then calculate the log likelihood ratio of **W** vs **W**′, given data generated from **W**:

In order for the relative entropy, Fisher information and likelihood ratio to be defined, the distributions must be sampled well enough that each spike state ** σ** that occurs in

*p*

_{W}occurs in ,

*p*

_{W′}and . With the total volume of spike data that we collected from the network, the largest value of

*n*for which this was possible is 10. Thus, these distributions are over 2

^{10}10-neuron binary spike state vectors.

### Inferring a partially observed network

For each of a range of sub-population sizes *n*, we randomly select multiple *n*-neuron sub-populations from the 100-neuron network. We use the spike data from each such sub-population to infer its corresponding *n×n* connectivity submatrix.

In order to have statistically informative results, for a given sub-population size *n*, we want to choose enough sub-networks such that together they cover the whole network reasonably well. Each *n*-neuron sub-network covers a fraction of the weight matrix of the entire network (ignoring diagonal terms). Using this we calculate that randomly selecting
sub-networks ensures with probability *p* that all synapses of the full network have been sampled at least once. We collect enough samples to have *p* = 0.99 for each sub-population size *n*.

## Acknowledgements

We thank Matthias Bethge, Peter Dayan, Ingmar Kanitscheider, Berk Gerçek and Rishidev Chaudhuri for helpful discussions. Funding was provided by the HHMI through the Faculty Scholars program and the Simons Collaboration on the Global Brain through the Simons Foundation.

## Footnotes

↵* fiete{at}mit.edu

## References

- [1].↵
- [2].
- [3].
- [4].
- [5].
- [6].↵
- [7].↵
- [8].↵
- [9].
- [10].↵
- [11].↵
- [12].
- [13].↵
- [14].↵
- [15].
- [16].
- [17].↵
- [18].↵
- [19].↵
- [20].
- [21].
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].
- [27].
- [28].
- [29].↵
- [30].↵
- [31].
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].
- [37].
- [38].↵
- [39].↵
- [40].↵
- [41].
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].
- [57].
- [58].
- [59].
- [60].
- [61].↵
- [62].↵
- [63].
- [64].↵
- [65].↵
- [66].
- [67].↵
- [68].
- [69].
- [70].
- [71].
- [72].
- [73].
- [74].↵