## Abstract

A central goal of systems neuroscience is to understand the relationships amongst constituent units in neural populations and their modulation by external factors using high-dimensional and stochastic neural recordings. Statistical models, particularly parametric models, play an instrumental role in accomplishing this goal, because their fitted parameters can provide insight into the underlying biological processes that generated the data. However, extracting conclusions from a parametric model requires that it is fit using an inference procedure capable of selecting the correct parameters and properly estimating their values. Traditional approaches to parameter inference have been shown to suffer from failures in both selection and estimation. Recent development of algorithms that ameliorate these deficiencies raises the question of whether past work relying on such inference procedures have produced inaccurate systems neuroscience models, thereby impairing their interpretation. Here, we used the Union of Intersections, a statistical inference framework capable of state-of-the-art selection and estimation performance, to fit functional coupling, encoding, and decoding models across a battery of neural datasets. We found that, compared to baseline procedures, UoI inferred models with increased sparsity, improved stability, and qualitatively different parameter distributions, while maintaining predictive performance across recording modality, brain region, and task. Specifically, we obtained highly sparse functional coupling networks with substantially different community structure, more parsimonious encoding models, and decoding models that rely on fewer single-units. Together, these results demonstrate that accurate parameter inference reshapes interpretation in diverse neuroscience contexts. The ubiquity of model-based data-driven discovery in biology suggests that analogous results would be seen in other fields.

## 1 Introduction

Neuroscience, like many fields of biology, is undergoing a rapid growth in the size and complexity of experimental and observational data [1, 2]. Realizing the benefits of these advances in data acquisition requires improvements in the statistical models characterizing the data, as well as the inference procedures used to fit those models [3, 4]. For example, parametric models, such as generalized linear models, are appealing because the model parameters can be interpreted to gain insight into the underlying biological processes that generated the data [5–8]. However, even for this ubiquitously used class of models, the impact of an inference procedure’s statistical properties on biological interpretation is poorly appreciated.

These issues are particularly salient in systems neuroscience, where parametric models are often used to understand how neural activity is modulated by external factors (e.g., stimuli or a behavioral task) and internal factors (e.g., other neurons) [9, 10]. The fitted parameter values, therefore, specify which factors are important in modulating neural activity, and how important they are. The specific relationships that a parametric model describes ultimately frames how the model will be interpreted in a neuroscientific context, emphasizing the importance of accurate parameter inference.

For example, functional coupling models (Fig 1a) capture the statistical dependencies between different functional units in the brain, at scales ranging from single units to functional areas [5–8, 11–14]. These models can be used to construct networks [15, 16], which could be analyzed with an assortment of tools from graph theory to characterize the population [17, 18], related to structural connectivity [19], used to assess directed influence amongst neurons (i.e., effective connectivity) [20, 21], or related to external factors such as behavior, genetics, aging, or psychiatric conditions [22–24]. Encoding models map the dependence of a brain signal (e.g., neuronal spikes) on external factors, such as stimuli (Fig 1b) [25–27]. An example encoding model is a spatio-temporal receptive field of, for example, a visual cortex neuron, which maps the image space to the neuronal response (Fig 1b, right) [28–30]. More refined encoding models of neural population data can be used to test theoretical and computational theories of neural coding [31–33]. On the other hand, decoding models map the brain signal to external factors, using the activities in, e.g., a neural population, to predict a stimulus or task-relevant behavioral condition (Fig 1c) [34–40]. A common linear decoding model is the extraction of a hyperplane in the neural activity space, which provides a decision boundary for one of two behavioral conditions or stimuli (Fig 1c, right: *s*_{1}, *s*_{2}) [22, 41, 42], though recent work has explored more advanced decoders, using artificial neural networks [39] or predictive latent representations [43]. Using decoding models for brain-computer interfaces has both clinical uses and scientific implications for understanding learning and motor control [44, 45]. Since these models are used to make scientific conclusions about the function of the brain, understanding the stability, accuracy, and parsimony of the inference procedures and resulting models is of paramount importance.

The utility of parametric models hinges on the assumption that the inference procedure used to fit them selects the correct parameters (i.e., specified as zero or non-zero) and properly estimates their values. The statistical consequences of improper selection are false positives or false negatives (Fig 1d), while poor estimation results in high bias (Fig 1e: e.g., *β*_{1}) or high variance (Fig 1e: e.g., *β*_{6}). The neuroscientific consequences of statistical inference lie in the interpretation of the fitted parametric model. Selection informs which internal and external factors are relevant for predicting neural activity, and estimation specifies their relative importance. Thus, validating that an inference procedure can reliably select and estimate a model’s parameters is vital to ensure that they motivate correct conclusions about neural activity.

These issues imply that, when fitting parametric models in a scientific context, multiple goals beyond predictive performance must be balanced to produce a scientifically meaningful model. In particular, achieving a parsimonious model, which uses the fewest number of features to sufficiently predict the response variable (i.e., finding the “simplest”), has long served as a goal in statistical model selection [46]. One approach to model parsimony relies on the imposition of sparsity during feature selection, which has the added benefit of identifying a small subset of predictive features, facilitating the interpretability of the model [47, 48]. This is particularly relevant in high-dimensional settings where there are few task-relevant features and strong priors from domain knowledge for selection may not exist. Another desired property is stability, or the reliability of a machine learning algorithm when its inputs are slightly perturbed [49, 50]. For a model to be interpretable, its parameters must be robust to the often noisy processes that generated the data. Thus, encouraging stability in a model’s parameter inference procedure will ensure that the features describing the relevant signal are selected and their correct contributions are properly estimated [51, 52]. Until recently, inference procedures that sufficiently balanced accurate selection and estimation, predictive performance, and stability were lacking.

The recently introduced Union of Intersections (UoI) is a statistical inference framework based on stability principles which enhances inference in a variety of common parametric models [53]. The properties characterizing UoI models — sparsity, stability, and predictive accuracy — are well-suited to data-driven discovery in neuroscience, due to the high dimensionality and many sources of variability in its datasets. Thus, UoI can be leveraged to assess whether baseline approaches to parameter inference in common models are susceptible to improper feature selection and estimation, and if so, assess the consequences for model interpretability in a neuroscience context.

In this work, we used the UoI framework to fit functional coupling, encoding, and decoding models to diverse neural data in an effort to elucidate the impacts of precise selection and estimation on neuroscientific interpretation. We found that, compared to baseline procedures, we obtained models with enhanced sparsity, improved stability, and significantly different parameter distributions, while maintaining predictive performance across recording modality, brain region, and task. Specifically, we obtained highly sparse coupling models of rat auditory cortex, macaque V1, and macaque M1 without loss in predictive performance. These models were used to construct functional networks that exhibited enhanced modularity and decreased small-worldness. We built parsimonious encoding models of mouse retinal ganglion cells and rat auditory cortex that more tightly matched with theory. These models were able to predict held-out neural responses with parameters that were as simple as possible, but no simpler. Lastly, we decoded task-relevant external factors from rat basal ganglia activity using fewer single units than baseline models. Overall, by emphasizing accurate selection and estimation during the inference of parametric neural models, we constructed equally predictive models that reshaped neuroscientific interpretation across a diverse set of neural data.

## 2 Methods

Our goal is to demonstrate how the statistical properties of inference algorithms impact the fitting and interpretation of diverse parametric models commonly used in neuroscience. The main tools we use for this purpose are algorithms based on the Union of Intersections framework [53–55]. Given the novelty of Union of Intersections, we first describe it conceptually, which also serves to introduce other relevant background. We then briefly summarize a comparison of a specific UoI algorithm, UoI_{Lasso}, versus other algorithms on a synthetic dataset to motivate UoI as a superior selection and estimation algorithm for subsequent application to neural data in the Results section.

### 2.1 The Union of Intersections framework balances sparsity, stability, and predictive performance

Union of Intersections (UoI) is not a single method or algorithm, but a flexible framework into which other algorithms can be inserted for enhanced inference. In this work, we apply the UoI framework to generalized linear models, focusing on linear regression (UoI_{Lasso}), Poisson regression (UoI_{Poisson}) and logistic regression (UoI_{Logistic}). We refer the reader to UoI variants of other procedures, such as non-negative matrix factorization [54] and column subset selection [53].

Consider the general problem of mapping a set of *p* features **x**∈ ℝ^{p×1} to a response variable *y* ∈ ℝ, of which we have *N* samples . For convenience, we focus on linear models, which require estimating *p* parameters * β* ∈ ℝ

^{p×1}that linearly map

**x**

_{i}to

*y*

_{i}. We describe the UoI framework in this context, which involves the algorithm UoI

_{Lasso}. The steps we detail, however, extend naturally to other penalized generalized linear models [56]. Typically, the mapping in linear models is corrupted by i.i.d. Gaussian noise

*ϵ*:

The parameters *β* can be inferred by optimizing the traditional least squares error on *y*:
where *i* indexes the *N* data samples. The UoI framework combines two techniques — regularization and ensemble methods — to balance sparsity, stability, and predictive performance, thereby improving on the traditional least squares estimate (Fig 2a).

Structured regularization, or the inclusion of penalty terms in the objective function to restrict the model complexity, can be useful when a subset of the *β*_{i} are exactly equal to zero, i.e., *β* is sparse. Sparsity implies that some features are not relevant for predicting the response variable. This assumption is often useful for data-driven discovery in biological settings, particularly for framing the interpretation of the model in the context of physical processes that generated the data. The identification of which *β*_{i} are non-zero can be viewed as a feature selection (or more generally, model selection) problem [48]. A common regularization penalty used for feature selection is the lasso penalty |*β*|_{1}, or the *ℓ*_{1}-norm applied to the parameters [47]. For the case of linear regression, this creates an optimization problem of the form

Solving Eq (3) returns parameter estimates with some sparsity, provided that *λ* is appropriately chosen (Fig 2a, top). Typically, *λ*, the degree to which feature sparsity is enforced, is unknown and must be determined through cross-validation or a penalized score function such as the Bayesian information criterion (BIC) [46] across a set of *J* hyperparameters . Importantly, solving the lasso problem simultaneously performs model selection (identifying the non-zero features) and model estimation (determining the specific values of those parameters). However, the application of the lasso penalty suffers from shrinkage [47], or a parameter bias that erroneously reduces the magnitudes of the parameters (Fig 2a, top: compare opacity of parameter estimates), and often does not correctly identify the true non-zero parameters (Fig 2a, top: false positives).

On the other hand, ensemble procedures (e.g., bagging and boosting [57, 58]) aggregate model fits across resamples of the data to improve the stability of parameter estimates (Fig 2a, bottom). The more stable parameter estimates result in improved predictive accuracy. This is particularly desirable in biological settings, where model aggregation ensures that the relevant signal in noisy data is reflected in the parameter estimates. However, ensemble procedures do not perform feature selection.

UoI separates model selection and model estimation into two stages, with each stage utilizing ensemble procedures to promote stability. Specifically, model selection is performed through intersection (compressive) operations and model estimation through union (expansive) operations, in that order. This separation of parameter selection and estimation provides selection profiles that are robust and parameter estimates that have low bias and variance. Fig 2b and 2c provide a visual depiction of the UoI framework. For UoI_{Lasso}, the procedure is as follows:

#### Model Selection

Define the support *S* as the set of non-zero parameters in an estimate . For each λ_{j}, generate parameter estimates by solving the lasso optimization problem (Eq 3) on *N*_{S} resamples of the data, and calculate a support for each resample-λ_{j} pairing. The intersection step requires that only the features that appear in a sufficient number of resamples are included in the final *stability support S*_{j} for λ_{j}. We depict this in Fig 2b, top, where the light pink bands denote features that are included in the model due to regularization while the dark pink bands denote features that are included in the stability support after the intersection across resamples. The bands are arranged in order of increasing regularization strength (Fig 2b: *x*-axis) and thus sparser (i.e., smaller) support sets (Fig 2b: *y*-axis). Note that the stability support may be calculated with a hard intersection (e.g., Fig 2c: Model Selection) or a soft intersection. In the former case, a feature must appear in the support of every resample to be included in *S*_{j}. In the latter, the feature must only appear in a sufficient fraction of supports which is a hyperparameter.

#### Model Estimation

For each stability support *S*_{j}, estimate the model without regularization on *N*_{E} resamples of the data. Each resample contributes the (now fitted) support that had the best predictive performance on its data to the union step (Fig 2b: bottom). Unique supports may have the best performance across multiple resamples (e.g., only three unique supports, *S*_{j−1}, *S*_{j}, and *S*_{m−1}, are included for model averaging in Fig 2b). The best fitted parameter estimates for each resample are unionized according to some metric (e.g., median, mean, etc.), resulting in a final parameter estimate (Fig 2b, bottom). For UoI_{Lasso}, this procedure consists of applying Ordinary Least Squares to each stability support and resample combination (Fig 2c: Model Estimation).

UoI’s modular approach to parameter inference capitalizes on the feature selection achieved by stability selection and the unbiased, low-variance properties of the bagged OLS estimator. Furthermore, UoI’s novel use of model aggregating procedures within its resampling framework allows it to achieve highly sparse (i.e., only using features robust to perturbations in the data) and predictive (i.e., only using features that are informative) model fitting. Importantly, this is achieved without imposing an explicit prior on the model distribution, and without formulating a non-convex optimization problem. Since the optimization procedures across resamples can be performed in parallel, the UoI framework is naturally scalable, a fact that we have leveraged to facilitate parameter inference on larger datasets [55]. The application of UoI_{Lasso} in a data-distributed manner is depicted in Fig 2c. In the selection module, the first column depicts lasso estimates across data resamples for a particular choice of regularization parameter, all of which can be fit in parallel (Fig 2c, Model Selection: left column). In the estimation module, OLS estimates are fit across resamples and supports, which can be done in parallel (Fig 2c, Model Estimation: left column).

### 2.2 Model evaluation

We describe the metrics used to evaluate the models fitted to synthetic data and the neural data. Note that only the selection ratio, predictive performance measures, and Bayesian information criterion were used to evaluate models on neural data where ground truth was unknown.

#### Selection accuracy

The selection accuracy, or set overlap, is a measure of how well the estimated support captures the ground truth support. Define *S*_{β} as the set of features in the ground truth support, as the set of features in the estimated model’s support, |*S*| as the cardinality of *S*, and Δ as the symmetric set difference operator. Then the selection accuracy is defined as

The selection accuracy is bounded in [0, 1], taking value 0 if and *S*_{β} have no elements in common, and taking value 1 iff they are identical.

#### Estimation error

The estimation error of the *p* fitted parameters , with ground truth parameters *β*, is defined as the root mean square error, or

#### Estimation variability

The estimation variability for parameter *β*_{i} is defined as the parameter standard deviation *σ*(*β*_{i}). We calculated this quantity by taking the variance of the estimated parameter over *R* resamples of the data:
where *j* indexes the resample. To summarize this measure across all *p* parameters in a model, we took the average, i.e., .

#### Selection ratio

We evaluate the sparsity of estimated models with the selection ratio, or the fraction of parameters fitted to be non-zero:
where *p* is the total number of parameters available to the model and *k* is the number parameters that a model-fitting procedure explicitly sets non-zero.

#### Predictive performance

We utilized several measures of predictive performance, depending on the model. For linear models (i.e., a generalized linear model with an identity link function), we used the coefficient of determination (*R*^{2}) evaluated on held-out data:
where *y*_{i} is the ground truth response for sample *i*, its corresponding predicted value, and the mean of the response variable over trials. *R*^{2} has a maximum value of 1, when the model perfectly predicts the response variable across samples. *R*^{2} values below zero indicate that the model is worse than an intercept model (i.e., simply using the mean value to predict across samples).

For Poisson regression, or a generalized linear model with a logarithmic link function, we utilized the deviance, which is the difference in log-likelihood between the saturated model and the estimated model [59]. The saturated model has parameters specifically chosen to reproduce the observed values. For the Poisson log-likelihood, the expression for the deviance as a function of the estimation parameters is given by where denote the features and response variable of the model, respectively. Note that lower deviance is preferred, in contrast to the coefficient of determination.

For logistic decoding models, we used the classification accuracy on held-out data as the measure of predictive performance.

#### Model Parsimony

We evaluated model parsimony using the Bayesian information criterion (BIC) [46]:
Here, *D* is the number of samples, *k* is the number of parameters estimated by the model, and log is the log-likelihood of the parameters under data and model *m*. Thus, the BIC includes a penalty that encourages models to be more sparse (first addend) while still accounting for predictive accuracy (second addend). Importantly, the BIC is evaluated on the data that the model was trained on (rather than held-out data). It is typically used as a model selection criterion (in lieu of, for example, cross-validation). When used as a model selection criterion, the model with lower BIC is preferred.

### 2.3 UoI_{Lasso} achieves superior selection and estimation performance on synthetic data

We evaluated UoI_{Lasso}’s abilities as an inference procedure by assessing its performance on synthetic data generated from a linear model. The performances of UoI and five other inference procedures are depicted in Fig 3: UoI_{Lasso} (black), ridge regression (purple) [48], lasso (green) [47], smoothly clipped absolute deviation (SCAD; red) [60], bootstrapped adaptive threshold (BoATS; blue) [61], and debiased lasso (dbLasso; coral) [62].

The linear model consisted of *p* = 300 total parameters, with *k* = 100 non-zero parameters (thereby having sparsity 1 − *k/p* = 2/3). The non-zero ground truth parameters were drawn from a parameter distribution characterized by exponentially increasing density as a function of parameter magnitude (Fig 3b: gray histograms). We used *N* = 1200 samples generated according to the linear model (1) with noise magnitude chosen such that Var(*ϵ*) = 0.2 × |*β*|_{1}. We report metrics according to their statistics across 100 randomized cross-validation samples of the data.

In Fig 3a, we show scatter plots comparing the predicted and actual values of the observation variable on held-out data samples. We visualized how well the inference procedures captured the underlying parameter distribution by comparing the histograms of (average) estimated model parameters (colors) overlaid on the ground truth model parameters (grey) (Fig 3b). We additionally plotted parameter bias and variance, first by comparing the mean estimated value (± standard deviation) against the ground truth parameter value (Fig 3c), and then examining the standard deviation of the parameter estimates as a function of their mean estimated value (Fig 3d).

Fig 3a-d captures the improvements that the UoI framework offers in parameter inference. UoI_{Lasso} is designed to maximize prediction accuracy (Fig 3a) by first selecting the correct features (Fig 3b), and then estimating their values with high accuracy (Fig 3c) and low variance (Fig 3d). By separating model selection and model estimation, UoI_{Lasso} benefits from strong selection (as in BoATS and debiased Lasso), but with the low variability of the structured regularizers (Lasso, SCAD), while alleviating shrinkage with its nearly unbiased estimates.

We quantified the performance of the inference algorithms on synthetic data using a variety of metrics capturing selection performance, bias, variance, and prediction accuracy. UoI_{Lasso} generally resulted in the highest selection accuracy (Fig 3e, first column), parameter estimates with lowest error (Fig 3e, second column) and competitive variance (Fig 3e, third column). In addition, it led to the best prediction accuracy (Fig 3e, third column). UoI_{Lasso} best captured the true model size (Fig 3e, fourth column), avoiding the abundance of false positives suffered by most other inference algorithms. UoI_{Lasso}’s enhanced predictive performance with fewer features resulted in superior model parsimony (Fig 3e, fifth column).

### 2.4 Neural recordings

We sought to demonstrate impact of improved inference on parametric models across a diversity of datasets, spanning distinct brain regions, animal models, and recording modalities. We used microelectrocorticography recordings obtained from rat auditory cortex (for coupling and encoding models), single-unit recordings from macaque visual and motor cortices (coupling models), single-unit recordings from isolated rat retina (encoding models), and single-unit recordings from basal ganglia (decoding models). We briefly describe the experimental and preprocessing steps for each dataset.

#### Recordings from auditory cortex

Auditory cortex (AC) data was comprised of cortical surface electrical potentials (CSEPs) recorded from rat auditory cortex with a custom fabricated micro-electrocorticography (*μ*ECoG) array. The *μ*ECoG array consisted of an 8 × 16 grid of 40 *μ*m diameter electrodes. Anesthetized rats were presented with 50 ms tone pips of varying amplitude (8 different levels of attenuation, from 0 dB to −70 db) and frequency (30 frequencies equally spaced on a log-scale from 500 Hz to 32 kHz). Each frequency-amplitude combination was presented 20 times, for a total of 4200 samples. The response for each trial was calculated as the *z*-scored, to baseline, high-*γ* band analytic amplitude of the CSEP, calculated using a constant-Q wavelet transform. Of the 128 electrodes, we used 125, excluding 3 due to faulty channels. Data was recorded by Dougherty & Bouchard (DB). Further details on the surgical, experimental, and preprocessing steps can be found in [63].

#### Recordings from primary visual cortex

We analyzed three primary visual cortex (V1) datasets, comprised of spike-sorted units simultaneously recorded in three anesthetized macaque monkeys. Recordings were obtained with a 10 10 grid of silicon microelectrodes spaced 400 *μ*m apart and covering an area of 12.96 mm^{2}. Monkeys were presented with grayscale sinusoidal drifting gratings, each for 1.28 s. Twelve unique drifting angles (spanning 0° to 330°) were each presented 200 times, for a total of 2400 trials per monkey. Spike counts were obtained in a 400 ms bin after stimulus onset. We obtained [106, 88, 112] units from each monkey. The data was obtained from the Collaborative Research in Computational Neuroscience (CRCNS) data sharing website [64] and was recorded by Kohn and Smith (KS) [65]. Further details on the surgical, experimental, and preprocessing steps can be found in [66] and [67].

#### Recordings from primary motor cortex

Primary motor cortex (M1) data was comprised of spike-sorted units simultaneously recorded in the motor cortex of Rhesus macaque monkey. Recordings were obtained with a chronically implanted silicon microelectrode array consisting of 96 electrodes spaced at 400 *μ*m and covering an area of 16 mm^{2}. We used three datasets, consisting of three recording sessions from monkey I. The behavioral task required the monkey to make self-paced reaches to targets arranged on a 8 × 17 grid. Spike counts were binned at 150 ms over the course of the entire recording session, resulting in [4089, 4767, 4400] samples per recording session. We obtained [136; 146; 147] units from each dataset. Data was recorded by O’Doherty et al. (OCMS) and obtained from Zenodo [68]. Further details on the surgical, experimental, and preprocessing steps can be found in [69].

#### Recordings from retina

Retina data comprised spiking activity, extracellularly recorded from isolated mice retina. Recordings were obtained using a 61-electrode array. Isolated retina were presented with a flicking black or white bar stimulus according to a pseudo-random binary sequence for a period of 16.6 ms. We utilized recordings from 23 different retinal ganglion cells. Data was obtained from CRCNS and recorded by Zhang et al [70]. Further details on the surgical, experimental, and preprocessing steps can be found in [71].

#### Recordings from basal ganglia

Basal ganglia data comprised tetrode recordings from two regions of rat basal ganglia: the globus pallidus pars externa (GPe: 18 units) and substantia pars nigra reticulata (SNr: 36 units). Recordings were performed during a rodent stop-signal task. Briefly, a rat was prompted to enter a center port with a light cue. The rat remained in the port until a Go cue (audio stimulus at 1 kHz or 4 kHz) which directed a lateral head movement to the left or right ports. On a subset of trials, the Go cue was followed by a Stop signal (white noise burst), indicating that the rat should remain in the center port. We utilized the successful Go trials, in which the rat was not given a Stop signal and successfully entered the correct port (186 trials). We used the spike count in the first 100 ms after the rat exited the center port to predict the behavioral condition (left or right). Further details on the surgical, experimental, and preprocessing steps can be found in [72].

### 2.5 Neural data analysis and model fitting

All models fit to neural data consisted of various generalized linear models, depending on the application. We trained all baseline models using either the `glmnet` [56] or `scikit-learn` [73, 74] packages. Meanwhile, we trained all UoI models using the `pyuoi` package [55].

#### Model selection criterion in the estimation module

In the UoI framework, the estimation module operates by unionizing fitted stability supports across resamples. Thus, the module requires a criterion by which to choose the best fitted stability support per resample. A natural choice, akin to cross-validation, is the out-of-resample validation performance according to some measure (e.g., *R*^{2}, deviance, etc.). However, the use of cross-validation as a model selection tool is known to suffer from inconsistent feature selection [75–77]. Therefore, we instead utilized the Bayesian information criterion in the estimation module for each model, which has been shown to be model selection consistent [46].

#### Data analysis for coupling models

We used UoI_{Lasso} (rat auditory cortex) and UoI_{Poisson} (macaque V1 and M1) to fit coupling models. The auditory cortex model can be described with a linear model as
where *n*_{i} is the high-gamma activity of the *i*th electrode on a trial. The baseline procedure consisted of a lasso optimization with coordinate descent, while the UoI approach utilized UoI_{Lasso}. The model for the spiking datasets, which utilizes a Poisson generalized linear model, can be written as
where *n*_{i} corresponds to the spike count of the *i*th neuron. The corresponding objective function for this model is the log-likelihood,
where *i* denotes that this model corresponds to the *i*th neuron, *j* indexes over the remaining neurons, and *k* indexes over the *D* data samples. The baseline approach consisted of applying coordinate descent to solve this objective function with an elastic net penalty:
where λ_{1} and λ_{2} are the two hyperparameters specifying the strength of the *ℓ*_{1} and *ℓ*_{2} penalties, respectively. Note that the intercept terms were not penalized. Meanwhile, UoI_{Poisson} utilized objective function (14) with only an *ℓ*_{1} penalty in the selection module. In the estimation module, we used Eq (14) with a very small *ℓ*_{2} penalty for numerical stability purposes. The specific optimization algorithm was a modified orthant-wise L-BFGS solver [78].

#### Data analysis for encoding models

For retinal data, we fit spatio-temporal receptive fields (STRFs) frame-by-frame. Specifically, the STRF was comprised of *F* frames *β*_{1}, *β*_{2}, … , *β*_{F}, each a vector of size *M* and spanning Δ*t* seconds. For neuron *i* and frame *k*, the encoding model consisted of
where *n*_{i}(*t*) is the spike count at timepoint *t* and **e**(*t* − *k*Δ*t*) is flicking bar stimulus value at *k* bins before *t*. We fit the *F* models using lasso (baseline) and UoI_{Lasso}, and created the final STRF by concatenating the parameter values *β* = [*β*_{1}, *β*_{2}, … , *β*_{F}].

The tuning model for the rat auditory recordings was constructed using Gaussian basis functions. We used eight Gaussian basis functions spanning the log-frequency axis with means [5, 13]. Thus, the high-gamma activity *n*_{i} of electrode *i* in response to frequency *f* was

We chose *σ*^{2} = 0.64 octaves so that basis functions sufficiently spanned the plane. We chose *p* = 8 basis functions because this was the minimum number of basis functions for which every electrode had a selection ratio less than 1. We fit Eq (17) using cross-validated lasso as the baseline. To characterize the relationship between selection ratio and predictive performance of the rat AC tuning models, we fit trendlines across models using Gaussian process regression. Specifically, we utilized a regressor with radial basis function kernel (length scale *ℓ* = 0.01) and a white noise kernel (noise level *α* = 0.1).

#### Data analysis for decoding models

We fit decoding models to basal ganglia recordings as binary logistic regression models. The model expresses the probability of one experimental condition *e* (e.g., the rat entering the left port) as
where *n*_{i} is the neural activity of the *i*th neuron. The corresponding objective function is the log-likelihood, or

The baseline approach consisted of solving this objective function with an *ℓ*_{1} penalty. UoI_{Logistic} utilized objective function (19) with an *ℓ*_{1} penalty in the selection module, and Eq (19) alone in the estimation module.

#### Cross-validation, model training, and model testing

Each dataset was split into 10 folds after shuffling across samples (except for the basal ganglia data, which was split into 5 folds due to fewer samples). When appropriate, the folds were stratified to contain equal proportions of samples across experimental setting (e.g., stimulus value or behavioral condition). In each task, we fit 10 models (or five, for basal ganglia) by training each on 9 (4) folds, and using the last fold as a test set. Hyperparameter selection for baseline procedures was performed via cross-validation within the training set of 9 (4) folds. Meanwhile, all resampling for the UoI procedures was also performed within the training set. Model evaluation statistics (selection ratio, predictive performance, Bayesian information criterion) are reported as the median across the 10 models. Any measures that operate on the fitted models (e.g. coefficient value, network formation, modularity, etc.) were calculated by using the model that is formed by taking the median parameter value across folds.

#### Statistical tests

We used the Wilcoxon signed-rank test [79] to assess whether the distributions of selection ratios and predictive performances, across units, were significantly different between the UoI models and the baseline models. Importantly, we did not apply the test to the distribution of BICs, since differences in BIC are better interpreted as approximations to Bayes factors [80]. To assess whether distributions of UoI and baseline model parameters were significantly different, we used the Kolmogorov-Smirnov test. We applied a significance level of *α* = 0.01 for all statistical tests.

#### Effect size

To fully capture the difference in model evaluation metrics beyond statistical significance, we measure effect size using Cohen’s *d* [81]. For two groups of data with sample sizes *D*_{1}, *D*_{2}, means *μ*_{1}, *μ*_{2}, and standard deviations *s*_{1}, *s*_{2}, Cohen’s *d* is given by
where *s* is the pooled standard deviation:

We often considered cases where *n*_{1} = *n*_{2}, implying that . Values of *d* on the order of 0.01 indicate very small effect sizes, while *d* > 1 indicates a very large effect size [82].

### 2.6 Network creation and analysis

#### Network creation

We created directed graphs by filling the adjacency matrix *A*_{ij} with the coefficient *β*_{ij} (i.e., the coupling coefficient for neuron *j* in the coupling model for neuron *i*). Meanwhile, we created undirected networks from coupling models by symmetrizing coefficients [83]. Specifically, the symmetric adjacency matrix satisfies where *β*_{ij} is the coefficient specifying neuron *i*’s dependence on neuron *j*’s activity, and vice versa for *β*_{ji}. Thus, the network lacked an edge between vertices (neurons) *i* and *j* if only if neuron *i*’s coupling model did not depend on neuron *j*, and neuron *j*’s coupling model did not depend on neuron *i*. This adjacency matrix is weighted in that each entry depends on the magnitudes of the coupling coefficients. However, we can also consider an unweighted, undirected graph, whose adjacency matrix is simply the binarization of *A*_{ij}.

#### Modularity

The modularity *Q* is a scalar value that measures the degree to which a network is divided into communities [84]. We operate on the undirected graph described by the binarized adjacency matrix, described in the previous section. Suppose each vertex *v* is partitioned into one of *c* communities, where vertices within a community are more likely to be connected with each other than vertices between communities. Then, the modularity is defined as
where *c*_{v} denotes community identity, *k*_{v} is the degree of vertex *v*, and *m* is the total number of edges. Thus, the modularity is greater than zero when there exist more edges between vertices within the same community than might be expected by chance according to the degree distribution. Specifically, *Q* is bounded within the range [−1/2, 1], where *Q* > 0 indicates the existence of community structure.

We calculated modularity with the Clauset-Newman-Moore greedy modularity maximization algorithm[85]. This procedure assigns vertices to communities by greedily maximizing the modularity, and then calculating *Q* using the ensuing community identities.

#### Small-worldness

Small-world networks are characterized by a high degree of clustering with a small characteristic path length [17, 86, 87]. There are multiple measures used to quantify the degree to which a network is small-world. We use *ω*, which can be expressed as
where *L* is the characteristic path length of the network, *L*_{r} is the characteristic path length for an equivalent random network, *C* is the clustering coefficient, and *C*_{ℓ} is the clustering coefficient of an equivalent lattice network [88]. The quantity *ω* is bounded within [−1, 1], where *ω* close to 0 indicates that the graph is small-world. When *ω* is close to 1, the graph is closer to a random graph, while *ω* close to −1 implies the graph is more similar to a lattice graph.

## 3 Results

Parametric models are ubiquitous data analysis tools in systems neuroscience. However, their usefulness in understanding a neural system hinges on the assumption that their parameters are accurately selected and estimated. By accurate selection, we mean low false positives and false negatives in setting parameters equal to zero; by accurate estimation, we mean low-bias and low-variance in the parameter estimates. The potential neuroscientific consequences of improper selection or estimation during inference are generally not well understood. Thus, we studied selection and estimation in common systems neuroscience models by comparing the properties of models inferred by standard methods to those inferred by the Union of Intersections (UoI) framework, which has been shown to achieve state-of-the-art inference in such models. We fit models spanning functional coupling (coupling networks from auditory cortex, V1, and M1), sensory encoding (spatio-temporal receptive fields from retinal recordings and tuning curves from auditory cortex), and behavioral decoding (classifying behavioral condition from basal ganglia recordings). We analyzed the fitted models to assess whether improvements in inference impact the resulting neuroscientific conclusions.

### 3.1 Highly sparse coupling models maintain predictive performance

Functional coupling models detail the statistical interactions between the constituent units (e.g., neurons, electrodes, etc.) of a population. Such models can be used to construct networks, whose structural properties may elucidate the functional and anatomical organization of the neurons within the population [14, 15, 19]. Enhanced sparsity in these models could result in different inferred functional sub-networks reflected in the ensuing graph. Furthermore, obtaining biased parameter estimates obscures the relative importance of neuronal relationships in specific sub-populations. Therefore, precise selection and estimation in coupling models is necessary to properly relate the network structure to the statistical relationships between neurons.

We examined the possibility of building highly sparse and predictive coupling networks by fitting coupling models to data from three brain regions: recordings from auditory cortex (AC), primary visual cortex (V1), and primary motor cortex (M1). The AC data consisted of micro-electrocorticography (*μ*ECoG) recordings from rat during the presentation of tone pips (Dougherty & Bouchard, 2019: DB). The V1 data consisted of single-unit recordings in macaque during the presentation of drifting gratings (Kohn & Smith, 2016: KS). The M1 data consisted of single-unit recordings in macaque during self-paced reaches on a grid of targets (O’Doherty, Cardoso, Makin, & Sabes: OCMS). See Methods for further details on experiments, model fitting, and metrics used for model evaluation, and see Table B.1 for a model statistic summary.

We constructed coupling models consisting of either a regularized linear model (AC) or Poisson model (V1, M1) in which the activity of a electrode/single-unit (i.e., node) was modeled using the activities of the remaining electrodes/single-units in the population. Thus, each dataset had as many models as there were distinct electrodes/single-units. We quantified the size of the fitted models with the selection ratio, or the fraction of parameters that were non-zero. We compare the selection ratio between baseline and UoI coupling models across electrodes/single-units in Fig 4a. For all three brain regions, UoI models exhibited a marked reduction in the number of utilized electrodes/single-units. Specifically, UoI models used 2.24 (AC), 2.21 (V1), and 5.50 (M1) times fewer features than the corresponding baseline models. Across the populations of electrodes/neurons, this reduction was statistically significant (*p* << 0.001; see Table B.1b) with large effect sizes (AC: *d* = 1.74; V1: *d* = 2.26; M1: *d* = 2.49; see Table B.1b). Interestingly, while the reduction in features for AC and V1 are roughly similar, the M1 models exhibit a much larger reduction in selection ratio, an observation that holds across the three M1 datasets. This disparity in feature reduction across brain regions indicates that false positives in the baseline coupling models may reflect differences in the nature of neural activity for these regions.

We assessed whether the reduction in features resulted in meaningful loss of predictive accuracy. We measured predictive accuracy using the coefficient of determination (*R*^{2}) for linear models (AC) and the deviance for Poisson models (M1, V1), both evaluated on held out data. Note that in contrast to *R*^{2}, lower deviance is preferable. The predictive performances of baseline and UoI models for each brain region are compared in Fig 4b. We observed that there is almost no change in the predictive performance across brain regions, with most points lying on or close to the identity line. We note that while the differences in performance across all models were statistically significant (AC: *p* < 10^{−3}; V1: *p* << 0.001; M1: *p* << 0.001; see Table B.1c), the effect sizes of the reduction in predictive performance were very small (AC: *d* = 0.005; V1: *d* = 0.05; M1: *d* = 0.03; see Table B.1c), making it irrelevant in practice. Thus, these results imply that highly sparse coupling methods exhibit little to no loss in predictive performance across brain regions and datasets.

We captured the two previous observations — increased sparsity and maintenance of predictive accuracy — with difference in Bayesian information criterion (BIC) between baseline and UoI methods, ΔBIC = BIC_{baseline} − BIC_{UoI}. Lower BIC is preferable, so that positive ΔBIC indicates that UoI is the more parsimonious and preferred model. The distribution of ΔBIC across coupling models is depicted in Fig 4c. ΔBIC is positive for all models, with a large median difference (AC: 170; V1: 149; M1: 186; see Table B.1d), providing very strong evidence against the baseline models.

To characterize the functional relationships inferred by the coupling models, we examined the distribution of coefficient values. We normalized each model’s coefficients by the coefficient with largest magnitude across the baseline and UoI models, and concatenated coefficients across models and datasets. We visualized the baseline and UoI coefficient values using a 2-d hexagonal histogram (Fig 4d). First, we observed a density of bins above (positive coefficients) and below (negative coefficients) the identity line (Fig 4d: red dashed line). This indicates that the magnitude of non-zero coefficients as fit by UoI are larger than the corresponding non-zero coefficient as fit by the baseline, demonstrating the amelioration of shrinkage and therefore reduction in bias. Next, we observed a density of bins on the *x* = 0 line, indicating a sizeable fraction of coefficients determined to be non-zero by baseline methods are set equal to zero by UoI. This density corroborates the reduction in selection ratio observed in Fig 4a. We further note that the density of bins on the *x* = 0 line encompass a wide range of baseline coefficients values, especially for the V1 and M1 datasets. This implies that utilizing a thresholding scheme (e.g., BoATS) based on the magnitude of the fitted parameters for a feature selection procedure will not reproduce these results. Lastly, we observe no density of bins along the *y* = 0 line, which indicates that UoI models are likely not identifying the existence of functional relationships which do not exist (i.e., suffering from false positives).

While many of the coefficients set equal to zero by UoI have large magnitude (as measured by baseline methods), the bulk of density lies in coefficients with small magnitude. We found that the difference in distributions of non-zero coefficients between the two procedures is statistically significant (*p* << 0.001; Kolmogorov-Smirnov test). We highlight the marginal distribution of non-zero coefficients whose magnitudes are small (Fig 4d, top and side histograms). While the baseline histograms (Fig 4d, top histograms) have the largest density of coefficients close to zero, the UoI histograms, in a similar range, exhibit a large reduction in density. Together, these results indicate that standard inference methods produce overly dense networks with highly biased parameter estimates, giving rise to qualitatively different parameter distributions.

#### Accurate inference enhance visualization, increase modularity, and decrease small-worldness in functional coupling networks

Functional coupling networks are useful in that they provide opportunities to visualize the statistical relationships within a population. Furthermore, their graph structures can be analyzed to characterize global properties of the network. The previous results show that improved inference gives rise to equally predictive models, but with much greater sparsity and qualitatively different parameter distributions. Thus, we next determined the impact on network visualization and structure. To this end, we constructed and analyzed both directed networks from the coupling coefficients and undirected networks by symmetrizing coupling coefficient values (see Methods).

We first visualized the AC networks by plotting the baseline and UoI networks according to their spatial organization on the *μ*ECoG grid (Fig 5a). Each vertex in Fig 5a is color-coded by preferred frequency, while the symmetrized coupling coefficients are indicated by the color (sign) and weight (magnitude) of edges between vertices. We observed that the UoI network is easier to visualize, with densities of edges clearly demarcating regions of auditory cortex. This is contrast to the baseline network, whose lack of sparsity makes it difficult to extract any meaningful structure from the visualization. For example, the UoI network exhibits a clear increase in edge density in primary auditory cortex (PAC) relative to the posterior auditory field (PAF) and ventral auditory field (VAF). Thus, the increased sparsity in UoI networks reveals graph structure that ties in closely with general anatomical structure of the recorded region.

We visualized the V1 and M1 networks by fitting nested stochastic block models to the directed graphs and plotting the ensuing structure in a circular layout with edge bundling (Fig 5b, c). The nested stochastic block model identifies communities of vertices (neurons) in a hierarchical manner. We color-coded the vertices of the visualized graphs according to preferred tuning (drifting grating angle for V1 networks and hand movement angle for M1 networks). The UoI V1 network exhibits clear structure, with specific communities identifying similarly tuned neurons (Fig 5b). The baseline V1 network does not exhibit as clear structure, and was highly unbalanced, with more than half the neurons placed in the same community. For the M1 data, the UoI communities were more balanced, though they lacked clear association with tuning properties. Together, these results demonstrate that enhanced sparsity of functional coupling networks facilitate their interpretation through cleaner visualizations and a clearer connection to functional response properties.

The plots above suggest different graph structures in the networks extracted by UoI and baseline methods. Thus, we first calculated the in-degree and out-degree distributions of the vertices in both networks (Fig 5d). We observed that the in-degree and out-degree distributions for the UoI networks are much smaller, as one might expect due to the reduction in edges. Furthermore, the in- and out-degree distributions describing the UoI networks are similar, in contrast to those of the baseline networks. Next, we calculated the modularity of the networks, which quantifies the degree to which the networks exhibit community-like structure. We found that the modularity for the UoI networks is much larger than that of baseline networks, indicating that UoI networks express more community structure than baseline networks (Fig 5e). These results corroborate the visual findings in Fig 5b. Since modularity implicitly depends on degree distribution, the enhanced community structure exhibited by the UoI networks is not simply a property of the reduction of in- and out-degrees, emphasizing that the enhanced sparsity is functionally meaningful. We found similar findings in functional networks built from linear models, rather than Poisson models, implying that more accurate inference ensures that the structure of coupling networks persists across the type of underlying model (Appendix A and Fig. A).

Finally, we examined the small-worldness of the networks, a graph structure commonly used to describe brain networks. Small-world networks are characterized by a high degree of clustering and low characteristic path length, making them efficient for communication. We used *ω* to calculate small-worldness, whose values are bounded by the range [−1, 1], with *ω* close to −1, 0, and 1 indicative of lattice, small-world, and random structure, respectively. Interestingly, UoI networks are considerably less small-world than the baseline networks (Fig 5f). However, we note that all networks are more small-world than they are random. Furthermore, the small-worldness of the networks is dependent on the brain region. For example, the V1 networks exhibit substantially more small-worldness than the auditory cortex or M1 networks. Together, these results demonstrate that several properties of networks can be substantially altered by the utilized inference procedure, and that UoI networks are more modular and less small-world.

### 3.2 Parsimonious tuning from encoding models

A long-standing goal of neuroscience is to understand how the activity of individual neurons are modulated by factors in the external world (e.g., how the position of a moving bar is encoded by a neuron in the retina). In such encoding models, incorrect feature selection or parameter bias may mistakenly implicate factors in the production of neural activity, or misstate their relative importance. Thus, we examined how improved inference impacts tuning models, where an external stimulus is mapped to the corresponding evoked neural activity.

We first fit spatio-temporal receptive fields (STRFs) to single-unit recordings from isolated mouse retinal ganglion cells during the presentation of a flicking black or white bar stimulus (generated by a pseudo-random sequence). We used a linear model with a lasso penalty to fit STRFs (i.e., regularized, whitened spike-triggered averaging) to recordings from 23 different cells, using a time window of 400 ms. Thus, the fitted STRFs were two dimensional, with one dimensional capturing space (location in the bar stimulus) and the other capturing the time relative to neural spiking. For further experimental and model fitting details, see Methods. See Table B.2 for a dataset and model statistic summary.

The fitted STRFs for an example retinal ganglion cell are depicted in Fig 6a. The UoI STRF captures the ON-OFF structure exhibited by the baseline STRF. However, the UoI model is noticeably sparser, resulting in a tighter spatial receptive field. The features set to zero by UoI (relative to baseline) include regions both further from the dominant ON-OFF structure, and regions very close to the center. Additionally, the coefficient values of the UoI STRF are noticeably larger in magnitude, implying that the improved estimation has alleviated shrinkage. These results suggest that the baseline procedure may produce false positives in both the central and distal regions of the STRF, implying that the relevant regions for predicting the activity of retinal ganglion cells may be more restricted.

We compared the selection ratios across fitted STRFs (Fig 6b) and found UoI fits to be substantially sparser, with a median reduction factor of 4.98. This reduction was statistically significant (*p* << 0.001; see Table B.2b) and had a very large effect size (*d* = 3.05; see Table B.2b). At the same time, UoI models exhibited a statistically significant improvement in *R*^{2} (*p* < 0.01; see Table B.2c), but with a very small effect size, making the improvement irrelevant in practice (*d* = 0.05; see Table B.2c). Meanwhile, the BIC differences (Fig 6d) were all large and positive (median difference = 654; see Table B.2d), providing very strong evidence in favor of the UoI model. Lastly, we compared the distribution of baseline and UoI encoding coefficients, normalized to the largest magnitude coefficient. We found evidence that the UoI models exhibited reduced shrinkage (Fig 6e: larger tails), and a substantial reduction in false positives (Fig 6e: reduced density at origin). Thus, improved inference resulted in STRFs with tighter structure, in better agreement with theoretical work characterizing such receptive fields and more accurately reflecting the visual features that explain the production of neural activity in retinal ganglion cells [89, 90].

Across neurons, we observed selection ratios and predictive performances spanning a wide range of values (Fig 6b, c). We might expect that models with little predictive accuracy utilize fewer features, since poor predictive performance indicates that the provided features are inadequate. For example, in the limit that the model has no predictive capacity (*R*^{2} ≤ 0), the model should utilize no features, since such an *R*^{2} indicates that none of the available features are relevant for reproducing the response statistics better than the mean response value. Therefore, we sought to determine whether inaccurate inference mistakenly identifies models as tuned (i.e., non-zero tuning features), when in fact a “non-tuned” model is appropriate (e.g., an intercept model: all features set equal to zero). To this end, we utilized a dataset in which the feature space dimensionality is small. This scenario provides a suitable test bed for assessing whether an intercept model could arise, and if such a model is appropriate given the response statistics of the data. We examined a dataset consisting of *μ*ECoG recordings from rat auditory cortex during the presentation of tone pips at various frequencies. We employed a linear tuning model mapping frequency to the peak (*z*-scored relative to baseline, see Methods), high-*γ* band analytic amplitude of each electrode. The model features consisted of 8 Gaussian basis functions that tiled the log-frequency space.

We first examined whether more accurate inference resulted in any qualitative changes in the fitted encoding models. We plotted the fitted tuning curves as a function of log-frequency for a subset of electrodes arranged according to their location on the *μ*ECoG grid (Fig 6f). Interestingly, the baseline and UoI tuning curves exhibit similar structure for a large fraction of the electrodes on the grid, in many cases matching closely (e.g., Fig 6f: anterior side of grid). In other cases, particularly on the posterior side of the grid, the UoI tuning curves exhibit similar broad structure with noticeable smoothing, indicating that the improved inference has simplified the tuning model.

We compared the selection ratio of the models (Fig 6g), finding that the UoI tuning curves utilize fewer features than those fit by baseline, with a median reduction factor of 2.5 that is statistically significant (*p* 0.001; see Table B.2b) and a large effect size (*d* = 2.19; see Table B.2b). Furthermore, despite a statistically significant decrease in *R*^{2} (Fig 6h) across electrodes (*p* 0.001; see Table B.2c), the observed effect size is very small (median Δ*R*^{2} = 0.001; *d* = 0.05; see Table B.2c). Meanwhile, we observed a median BIC difference of 19.4, providing very strong evidence in favor of the UoI models (Fig 6i; see Table B.2d). Taken together, these results imply that the reduction in features did not harm the predictive performance of the tuning models, thereby enhancing their parsimony. We highlight four electrodes whose selection ratios, according to UoI, are exactly zero in Fig 6g (pink points). These four “non-tuned” electrodes are among the least predictive, with *R*^{2} close to or below zero for both baseline and UoI methods (Fig 6h: pink points). Interestingly, the baseline selection ratio for one of these models was close to 0.4, indicating that UoI is not trivially generating intercept models. We visually examined the frequency-response areas (FRAs) of these four electrodes, which detail the mean response values as a function of sound frequency and amplitude (Fig. C). We compared them to the FRAs of two randomly chosen electrodes, finding they had little discernible structure relative to the “tuned” FRAs. Thus, while more accurate inference in encoding models may not always result in perceptible changes in their appearance (e.g., Fig 6f), there are cases where inaccurate inference may mistakenly imply that a constituent unit is tuned, when in fact the stimulus features are not relevant for capturing its response statistics. To understand the behavior of the selection ratio as *R*^{2} → 0, we examined the relationship between selection ratio and *R*^{2} for baseline (gray) and UoI (red) models (Fig 6j). We found that the sparser models exhibit lower predictive power, with model trends predicting that at *R*^{2} = 0, the selection ratio for the baseline model will be 0.35 ± 0.10 (mean ± 1 s.d.) while the UoI selection ratio will be 0.12 ± 0.10. This demonstrates that baseline procedures can suffer from false positives even when their fitted models exhibit little to no explanatory power. Overall, our results reveal that improved inference more accurately identifies units as non-tuned when their encoding models lack predictive ability.

### 3.3 Decoding behavioral condition from neural activity with a small number of single-units

Decoding models describe which neuronal sub-populations contain information relevant for an external factor, such as a stimulus or a behavioral feature. Such models can identify which neurons may be useful to a downstream population for a task that requires knowledge of an external factor. Specifically, a decoding model’s non-zero parameters can be interpreted as the sub-population of neurons containing the task-relevant information, emphasizing the need for precise selection. Additionally, the model details how specific neurons describe the decoded variable through the magnitudes of its parameters, requiring unbiased estimation. Thus, we sought to assess the degree to which accurate inference impacts data-driven discovery in neural decoding models.

For this analysis we examined 54 single units in the rat basal ganglia (18 units from globus pallidus pars externa, GPe, and 36 from the substantia nigra pars reticulata, SNr) that were recorded simultaneously during performance of a behavioral task involving rapid leftward or rightward head movements in response to cues. Details of the task are given in [72]; the analysis was restricted to trials in which a correct head movement was made. Thus, the decoding model consisted of binary logistic regression predicting trial outcome (left or right) using the single-unit spike counts as features. We fit the logistic regression with an *ℓ*_{1} penalty (baseline) and the UoI framework (UoI_{Logistic}). Further details on the experimental setup and model fitting can be found in Methods. See Table B.3 for a summary of the dataset and fitted model statistics.

The selection ratios for GPe and SNr, as obtained by baseline and UoI procedures, are depicted in Fig 7a. In GPe, the UoI decoding models utilized about half the number of parameters as the baseline procedures. Meanwhile, in SNr, the UoI models utilized four times fewer parameters than the baseline. Furthermore, we observe that UoI model sizes were more consistent across folds of the data. For example, the SNr decoding models estimated anywhere from 1 to 21 parameters (out of 36) depending on the data fold, while UoI models consistently used only 2 or 3 parameters (Fig 7a: IQR bars). These results validate that neural decoding models are capable of utilizing fewer features to predict relevant behavioral features. Furthermore, the stability principle ensures that these features are more robust to perturbations of the data (e.g., random subsamples).

To examine whether the use of fewer single-units decreased predictive performance, we evaluated the decoding models’ classification accuracy on held-out data, depicted in Fig 7b. The classification accuracy of the UoI models is equal to that of the baseline models for both GPe (67%) and SNr (100%). Furthermore, in both regions, the classification accuracy is greater than chance (56%), implying that the models are extracting meaningful information about the behavioral condition from the neural response. Interestingly, the median SNr models achieve perfect classification accuracy. The UoI model achieves this performance utilizing only 2 neurons, in contrast to median baseline model, which utilizes 8. Therefore, the activities of only a small subset of neurons are required to predict the behavioral condition, an observation that required accurate inference to consistently capture.

We examined the fitted coefficient values for each brain region and fitting procedure (Fig 7c). First, we observed that all coefficients set equal to zero by the baseline procedure are also set equal to zero by UoI. Additionally, the coefficients set equal to zero by UoI, but not the baseline procedure, typically have smaller magnitudes than the coefficients that are non-zero for both procedures. Finally, the coefficients set non-zero by UoI have larger magnitudes relative to their value under the baseline procedure. These observations imply that the UoI procedure, for this task, consistently utilized only the most important neurons to predict the behavioral condition. At the same time, UoI elevated the coefficient values relative to the baseline procedure, implying that baseline procedures suffered from substantial parameter shrinkage. Overall, these results demonstrate that task-relevant information is conveyed by a sparse subset of basal ganglia neurons, especially in SNr. SNr is a basal ganglia output nucleus, receiving converging inputs from multiple basal ganglia structures including GPe. The finding that SNr decodes behavioral output more selectively and accurately compared to GPe is consistent with the idea that SNr is closer to post-decision behavioral outputs, whereas GPe represent internal preparatory states [72].

## 4 Discussion

Parametric models are used pervasively in systems neuroscience to characterize neural activity. The parameters of these models must be precisely selected and estimated in order to ensure accurate interpretation, particularly in the sparse parameter regime that is desirable for neural data. Here, we used the UoI frame-work, which achieves state-of-the-art performance in balancing selection and estimation, to assess the degree to which poor parameter inference may impact neuroscientific interpretation of parametric models. We fit functional coupling, encoding, and decoding models to a battery of neural datasets, using standard and UoI inference procedures. We found that, across all models, the number of non-zero parameters could be reduced by a factor of 2–5, while maintaining predictive performance. Furthermore, we found broader, structural differences in the models beyond enhanced sparsity, which resulted in concrete changes to their neuroscientific interpretation.

The parameters obtained in coupling models denote the existence and strength of functional relationships between constituent units in a population. We observed striking differences in the distribution of these parameter estimates, which impacted the graph structure, manifesting in increased modularity and decreased small-worldness. These results do not directly contradict previous work characterizing brain networks as small-world, but do reduce the magnitude of the characterization [17, 87, 91] (though see [92]). Coupling model parameters have also been assessed for their recapitulation of synaptic weight distributions in neural circuits, in some cases identifying parameter biases induced by specific dynamical regimes of neural activity [93]. The coupling parameters extracted by UoI better reflect the weight distribution as observed in neural circuits, which is characterized by sparse connectivity with a heavy-tailed distribution [94, 95]. This was not achieved by baseline procedures, suggesting the previously identified biases could instead be due to inaccurate inference. The salient differences in the inferred coupling parameter distributions we observed motivates similar examination in models that capture neural dynamics, such as vector auto-regressive models [96, 97], which could be further assessed by controllability metrics used in recent work on fMRI networks [98].

The parameters in encoding models detail which features modulate neural activity. We observed that the application of UoI to the encoding models highlighted cases where the fitted model had only zero parameters, other than the intercept. Such an intercept model implies that a tuning model may be inappropriate for capturing the response statistics of the constituent units. This observation can be understood as a natural consequence of stability enforcement during parameter inference: UoI benefits from the stability principle by only utilizing selected features that persist across data resamples. The use of data resamples mimics perturbing the dataset, ensuring that features are included only if they are robust to those perturbations. Thus, the stability principle enforces model parsimony by encouraging the use of fewer features, eliminating those that offer no predictive accuracy throughout the resamples. However, UoI prioritizes predictive accuracy in the model averaging step. Therefore, models will only be made “as simple as possible” (e.g., removing all features) when they possesses no predictive ability. In contrast to the auditory cortex, we fit spatio-temporal receptive fields on the retinal ganglion cells with typical ON-OFF structure [8]. However, the stability enforcement of UoI resulted in models that were more spatially constrained, in better agreement with theoretical work characterizing the receptive fields [90, 99]. More broadly, these results indicate that such improvements in parameter inference could serve to close the gap between experiment and theory in systems neuroscience.

Decoding models can inform which internal factors contain information about a task-relevant external factor. We found that decoding models could be fit using fewer single-units, at no cost to classification accuracy, implying that task-relevant information can be confined to a very small fraction of a neural circuit. This has implications for communication between brain areas: wiring constraints often restrict information transmission through a relatively smaller number of projection neurons, suggesting that these neurons contain the relevant information required to “decode” a given signal [100, 101]. These results, in which we identified a very small fraction of the neurons capable of accurate decoding, raise the possibility that these selected neurons are, in fact, the projection neurons. Decoding using fewer single-units also has practical implications. Furthermore, an abundance of work has considered the impact of correlated trial-to-trial variability on the fidelity of a neural code by assessing the decoding ability of neural populations as a function of population size [102]. These decoding analyses can be informed by knowledge of the sparse sub-populations predictive of an external factor, which these results indicate are smaller than previously thought. Brain-machine interfaces (BMIs), which rely on accurate decoding from neural activity to operate, could reduce their power consumption by using a decoder relying on fewer single-units. Together, these results imply that accurate inference procedures, more capable of discovering specific task-relevant neuronal sub-populations, could drive the development of normative theories of neural communication and decoding.

Across brain regions and models, UoI resulted in more parsimonious models with differences in predictive performance that were irrelevant in practice, as measured by Cohen’s *d*. However, statistical tests comparing the distribution of predictive performance between the baseline and UoI models revealed a statistically significant decrease in predictive performance for some cases (coupling models, AC tuning) and statistically significant increase in others (retinal STRF, decoding). Depending on one’s goals, relying solely on predictive performance to judge a model may be unreliable [75–77, 103]. In particular, because model interpretability depends crucially on the included features and their estimates, we prioritized feature selection and estimation. Cross-validated predictive accuracy is often a poor criterion for accurate feature selection. In these cases, the BIC, which captures model parsimony, serves as a more suitable criterion [46], and universally favored the UoI models (though we note there is no single preferred model selection criterion [46, 77, 104–106]).

We considered models of neural activity exclusively in terms of coupling, encoding, or decoding. However, past studies have built parametric models of neural activity by using other features or model structures. For example, the combination of coupling and encoding in a single model is a natural extension which has been examined extensively in previous work [6–8, 13, 107]. Other features that are not constrained within coupling, encoding, or decoding — such as spike-time history or global fluctuations — are also important to incorporate [7, 9, 108]. Additionally, latent variable models have been used to great success to capture, in particular, the dynamics of neural activity [109–111]. In this work, the stability principles used by UoI resulted in a significant difference in the model sparsity and estimated parameter distribution, which impacted interpretation. It is worthwhile to assess whether similar results can be achieved in these extended models, and if so, determine the neuroscientific consequences.

We restricted our analysis to generalized linear models, because their structure lends itself well to interpretation, making them ubiquitous in neuroscience and biology. However, the improvements we obtained by encouraging stability and sparsity in these models may extend to other classes of models. For example, dimensionality reduction methods have also played an important role in systems neuroscience [112]. The UoI framework is naturally extendable to such methods, including column subset selection [53] and non-negative matrix factorization [54]. Furthermore, recent work has found success in using artificial neural networks (ANNs) to model neural activity, which excel at predictive performance [113]. Since ANNs are highly parameterized, these models are not interpretable in the sense that their parameter values do not directly convey neuroscientific meaning. Instead, these models are often interpreted through the lens of learned representations or recapitulation of emergent properties of neural activity. Future work could assess whether the modeling principles explored in this work could have similar effects for ANNs modeling neural activity, especially given recent advancements in compressing such models [114].

In this work, using the UoI framework, we assessed the consequences of precise parameter inference specifically on systems neuroscience models. We used UoI because of its demonstrated success at parameter selection and estimation and our familiarity with the framework, but it is not alone in the class of inference algorithms that excel at parameter inference [60, 115]. Furthermore, model-based approaches to data-driven discovery are employed prolifically throughout biology [116], spanning genomics [117], cell biology [118], epidemiology [119], ecology [120] and others. Thus, the sparsity and stability properties invoked by the UoI framework, and other approaches, could serve to reshape model interpretability across a wide range of biological contexts.

## A Comparison of Poisson and linear coupling models for single-unit activity

We used a Poisson distribution to model single-unit spike count activity in the M1 and V1 datasets. However, past work has modeled single-unit activity with a linear-Gaussian model, after applying variance-stabilizing square root transform to the spike count responses. The degree to which a linear model can capture the functional relationships identified by a Poisson model for spiking data is unclear. Thus, we sought to characterize this capability, and its dependence on the inference procedure. We modeled the neural activity after a square-root transform using a linear model

We fit this model with lasso optimization by coordinate descent (baseline) and UoI_{Lasso}.

We compared the fitted selection profiles, i.e. the set *S* = {*i*|*β*_{i} ≠ 0} between the linear and Poisson models. To do so, we used the hypergeometric distribution, which describes the probability that *k* objects with a particular feature are drawn from a population of size *M* that has *K* total objects with that feature, using *m* draws without replacement. To frame this in terms of selection, suppose the Poisson model has |*S*_{Poisson}| = *K* non-zero parameters out of the *M* possible features. Then, the probability the linear model, which has |*S*_{linear}| = *m* non-zero parameters, would match the Poisson model on *k* such features is given by the hypergeometric distribution:

Thus, the probability that the selection profile would overlap at most as well by chance as the linear model is given by *p*_{overlap} = 1 − Pr(*k* < *k*_{linear}). We compared the distribution of *p*_{overlap} across coupling models, calculated for both baseline and UoI procedures (Fig. A, panel a). For the V1 data, the UoI linear models better reproduced the Poisson selection profiles, with 98% of the profiles fit by UoI satisfying *p*_{overlap} < 0.001, compared to only 74% of the baseline selection profiles. In contrast, in the M1 data, both inference procedures fit linear models that closely matched the selection profiles of the Poisson models, with 99% of the selection profiles satisfying *p*_{overlap} < 0.001. Therefore, improved inference results in more consistent selection across models, and furthermore this consistency depends on brain region.

We constructed networks from the linear models as described in Methods, and calculated their in-degree and out-degree distributions. We compared the in-degree and out-degree distributions of the linear networks to the Poisson networks, finding a closer correspondence between the UoI models than for the baseline models in most cases (Fig. A, panel b). Specifically, the in-degrees of UoI models had a correlation of 0.742 (V1) and 0.969 (M1) compared to 0.717 (V1) and 0.924 (M1) for the baseline procedures. Similarly, we obtained out-degree correlations of 0.806 (UoI) and 0.531 (baseline) for V1 and 0.918 (UoI) and 0.924 (baseline) for M1. Lastly, we found that the UoI linear networks were more modular than the baseline linear networks. Interestingly, both were more modular than their Poisson counterparts (C, panel c). Taken together, these results imply that a more precise inference framework better preserves structure across model types.

## B Supplementary Tables

### B.1 Functional coupling dataset summary and fitted model statistics

Dataset summary (Table B.1a) provides details on the datasets used to fit functional coupling models, including the number of units and samples across brain region and recording session. The following tables provide statistics summarizing aspects of the fitted baseline and UoI models, including selection ratio (Table B.1b), predictive performance (Table B.1c), and Bayesian information criterion (Table B.1d).

### B.2 Encoding model dataset summary and fitted model statistics

Dataset summary (Table B.2a) provides details on the datasets used to fit encoding models, including the number of units and samples across dataset. The following tables provide statistics summarizing aspects of the fitted baseline and UoI models, including selection ratio (Table B.2b), predictive performance (Table B.2c), and Bayesian information criterion (Table B.2d).

### B.3 Decoding model dataset summary and fitted model statistics

Dataset summary (Table B.3a) provides details on the datasets used to fit decoding models, including the number of units and samples across dataset. The following tables provide statistics summarizing aspects of the fitted baseline and UoI models, including selection ratio (Table B.3b), and predictive performance (Table B.3c).

## C Frequency response area analysis for tuned and non-tuned electrodes

## Acknowledgments

We thank the members of the Neural Systems and Data Science lab for helpful feedback and discussion. P.S.S. was supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program. J.A.L. and K.E.B. were supported through the Lawrence Berkeley National Laboratory-internal LDRD “Deep Learning for Science” led by Prabhat. B.G. and J.D.B. were supported by NIH grants R01MH101697, R01NS078435 and R01DA045783, and the University of California, San Francisco.

## References

- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.
- 24.↵
- 25.↵
- 26.
- 27.↵
- 28.↵
- 29.
- 30.↵
- 31.↵
- 32.
- 33.↵
- 34.↵
- 35.
- 36.
- 37.
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵