Bypassing spike sorting: Density-based decoding using spike localization from dense multielectrode probes

Neural decoding and its applications to brain computer interfaces (BCI) are essential for understanding the association between neural activity and behavior. A prerequisite for many decoding approaches is spike sorting, the assignment of action potentials (spikes) to individual neurons. Current spike sorting algorithms, however, can be inaccurate and do not properly model uncertainty of spike assignments, therefore discarding information that could potentially improve decoding performance. Recent advances in high-density probes (e.g., Neuropixels) and computational methods now allow for extracting a rich set of spike features from unsorted data; these features can in turn be used to directly decode behavioral correlates. To this end, we propose a spike sorting-free decoding method that directly models the distribution of extracted spike features using a mixture of Gaussians (MoG) encoding the uncertainty of spike assignments, without aiming to solve the spike clustering problem explicitly. We allow the mixing proportion of the MoG to change over time in response to the behavior and develop variational inference methods to fit the resulting model and to perform decoding. We benchmark our method with an extensive suite of recordings from different animals and probe geometries, demonstrating that our proposed decoder can consistently outperform current methods based on thresholding (i.e. multi-unit activity) and spike sorting. Open source code is available at https://github.com/yzhang511/density_decoding.

In the method section of our paper, we describe the general encoding-decoding paradigm.In this section of the supplementary material, we delve into a specific case that focuses on decoding continuous behaviors, y k ∈ R T , that exhibit temporal variations.We introduce the use of ADVI, which allows us to model the relationship between firing rates of MoG components and behavior correlates using a generalized linear model (GLM)."baseline firing rate" of component c ADVI β ct "behavior-modulated firing rate" of component c at time bin t ) mean and variance of the variational posterior distribution for b c η β ct = (µ β ct , σ 2 β ct ) mean and variance of the variational posterior distribution for β ct η y tk = (µ y tk , σ 2 y tk ) mean and variance of the variational posterior distribution for y tk λ ct1 , λ ct0 firing rate of component c at time bin t that switches between two states ϕ variational paramter that represents the probability of y k = 1 CAVI ρ ictk unnormalized posterior probability of assigning s itk to component c r ictk normalized posterior probability of assigning s itk to component c y * k1 , y * k0 unnormalized posterior probability of y k = 1 and y k = 0 ν k normalized posterior probability of y k = 1

ADVI-based encoder
Building upon the general model specification outlined in Equations 1-3, we can describe the generative model of the encoder as follows: where b and β are the unknown model parameters corresponding to θ in Equation 1, and are sampled from a standard normal prior distribution.The latent spike assignment z depends on the mixing proportion π of the MoG, which is influenced by the behavior y through the firing rate λ.Intuitively, we can interpret λ ctk as the "firing rate" of component c at time t in trial k, while b c and β c describe the "baseline firing rate" and "behavior-modulated firing rate" of component c, respectively.
To learn the latent variables z, b and β, we posit a mean-field Gaussian variational approximation for the posterior distribution p(z, b, β | s, y).Obtaining exact updates for b and β is challenging due to the normalization term for π in Equation 7. Therefore, we employ ADVI to maximize the ELBO and utilize stochastic gradient ascent to update b and β.ADVI requires that the model be differentiable with respect to the parameters, and with a normal prior, the latent variables b and β reside in the real coordinate space and cause no issues with differentiability.The variational approximations for b and β are A drawback of the parameterization in Equations 6-8 is that the spike assignment variables z are discrete and not compatible with ADVI.An alternative, equivalent parameterization that addresses these problems is to marginalize over z.The marginalized model is Under this parameterization, the ELBO for the encoder is )

ADVI-based decoder
The decoder adopts the same generative model as the encoder described in Equations 6-8 with two exceptions: 1) The latent variable y tk is assumed to have a standard normal prior, i.e., y tk ∼ N(y tk ; 0, 1), assuming independence at each time step t.Alternatively, a Gaussian process prior can be chosen to capture temporal correlations between time steps.2) The parameters b and β are learned from the ADVI-based encoder and kept fixed in the decoder.
The mean-field Gaussian variational approximation for the posterior distribution p(z, y | s) is where To enable the use of ADVI, we can marginalize out the discrete latent variable z, thereby transforming the MoG model into a differentiable form.Under the marginalized MoG parametrization in Equations 12-14, the ELBO of the decoder is log N(y tk ; η y tk ) .

Decoding using coordinate ascent variational inference (CAVI)
We present a specific scenario for decoding binary variables, y k ∈ {0, 1}, where we derive exact updates for the variational variables using CAVI.

CAVI-based encoder
Extending the general model described in Equations 1-3, the generative model of the encoder can be defined as follows: where z and λ are the latent variables that we aim to learn.The behavior-dependent firing rates of each component c at time t vary based on the binary variable y, such that the components switch between two behavioral states characterized by firing rates λ ct1 and λ ct0 .
The log-likelihood of the encoder can be written as where To approximate the posterior p(z | s, y), we employ the mean-field variational approximation q(z) = c,t q(z ct ).The ELBO of the encoder is The exact update for q(z) is obtained by maximizing the ELBO with respect to q(z), which leads to the following update equation: where q −ct means c ′ c t ′ t q(z c ′ t ′ ).Then, denotes the unnormalized posterior probability of assigning spike i collected at time t in trial k to component c, while E[z ictk ] = ρ ictk / c ′ ρ ic ′ tk := r ictk represents the normalized posterior probability.
After fixing q(z), the term in the ELBO which depends on λ, µ and Σ can be expressed as while ELBO not converged do for all k ∈ 1 : K do for all t ∈ 1 : T do for all i ∈ 1 : n tk do Set q(z ictk = 1) ∝ ρ ictk .▷ eq. ( 27) end for end for end for for all c ∈ 1 : ▷ eq.(32-33) end for end for Compute the ELBO L CAVI enc .
We derive the update for λ by setting the gradients ∇ λ ct1 L and ∇ λ ct0 L to 0: Consider the gradient with respect to the η c parameter, The closed-form updates for µ c and Σ c are

CAVI-based decoder
The CAVI-based decoder employs the same generative model as the CAVI-based encoder, with the exception that the behavior-dependent firing rates λ ct1 and λ ct0 are learned by the encoder and kept fixed, and the behavior y is treated as an unknown latent variable.We sample y from a prior distribution, y k ∼ Bernoulli(ϕ), where ϕ is a variational parameter that represents the probability that y k = 1.The loglikelihood can be expressed as follows We use the factorization q(z, y) = q(z)q(y) = c,t q(z ct )q(y) to approximate the posterior distribution p(z, y | s).The ELBO of the decoder can be defined as Algorithm 2 CAVI-based decoder Input: while ELBO not converged do for all k ∈ 1 : K do Set q(y k = 1) ∝ y * k1 .▷ eq. ( 44) for all t ∈ 1 : T do for all i ∈ 1 : n tk do Set q(z ictk = 1) ∝ ρ ictk .▷ eq. ( 43) end for end for end for for all c ∈ 1 : ▷ eq. ( 46) Compute the ELBO L CAVI dec .
The exact updates for q(z) and q(y) that guarantee an increase in the ELBO are where and E[z ictk ] = ρ ictk / c ′ ρ ic ′ tk := r ictk are the unnormalized and normalized posterior probabilities of assigning spike i collected at time t in trial k to component c, respectively.The unnormalized posterior probabilities of y k = 1 and y k = 0 are and E[y k ] = y * k1 /(y * k1 + y * k0 ) := ν k represents the normalized posterior probability of y k = 1.
After fixing q(z) and q(y), the term in the ELBO which depends on ϕ, µ and Σ can be written as Considering the gradient of the ELBO with respect to the ϕ parameter, we obtain its update: The updates for η c are computed in a similar manner as described in Equations 36-39.

MoG initialization
We employ the following procedure to intialize the MoG model used in both the ADVI-based and CAVIbased algorithms: 1) According to Figure 1 (a), the spike feature distribution is highly multimodal.To determine the appropriate number of modes, we utilize isosplit (Magland et al. 2015) to cluster the spike features.This step helps in splitting the set of spike features into distinct clusters.2) For each identified cluster, we compute the mean and variance of the spike features belonging to that cluster, which serve as the parameters for the corresponding Gaussian component.This automatic selection of the number of MoG components and the initialization of means and covariance matrices facilitate the initialization of the ADVI-based and CAVI-based algorithms.

Data preprocessing
We provide a brief overview of our data preprocessing pipeline, which involves the following steps.
Destriping.During the data collection process, we encounter line noise due to voltage leakage on the probe.This translates into large "stripes" of noise spanning the whole probe.To mitigate the impact of these noise artifacts, we apply a destriping procedure (Chapuis et al. 2022).

Behavior decoder
To decode binary behaviors, such as the mouse's left or right choices, we utilize L 2 -penalized logistic regression.For decoding dynamic behaviors, such as wheel speed, we employ a sliding-window algorithm to aggregate the entries of the weight matrix, W kct , over time.Within the time window [t − δ, t + δ], where δ is the window size, we stack 2δ weight matrix entries, {W kct } C c=1 , for time point t in trial k.This aggregated weight matrix is then used as input for ridge regression to predict the behavior y tk at time t.The window size δ and the regularization strength are model hyper-parameters, set through cross-validation to achieve the optimal decoding performance.

Model interpretation
In this section, we provide visualizations to gain insights into the effectiveness of our proposed decoder.We quantify the posterior entropy of each spike assignment in Figure 7 (a).Spike assignments with higher entropy correspond to a spread of posterior probabilities among multiple MoG components.In contrast, traditional spike sorting or thresholding methods result in deterministic spike assignments, leading to lower entropy and empirically reduced decoding performance.In Figure 7 (b), we compare the trial-averaged weight matrices (W) used for decoding between spike-sorted, spike-thresholded, and our proposed decoders.The posterior entropy of the spike assignment is high, when the posterior probability of spike assignment, q(z ikct ), is spread out among several MoG components instead of being concentrated at a single component.The scatter plot shows that low-amplitude spikes that are difficult to assign have higher posterior entropy than high-amplitude spikes.(b) Visualizations of the averaged weight matrices W's across trials in an example IBL session.For "W (spike-sorted and spike-thresholded)", the W matrix has one-hot rows and each entry W kct is the spike count that belongs to KS unit (channel) c and time bin t in trial k.The purple crosses indicate the "good" KS units.For "W (ours)", W kct is the sum of posterior probabilities of spike assignments, and the MoG mixing proportion π depends on the behavior y and changes over time.The arrangement of KS units (channels or MoG components) on the heat maps is based on the depth of the NP probe, ensuring comparability across the displayed W matrices.

Decoding across brain regions
In addition to the previously mentioned five brain regions (PO, LP, DG, CA1, VISa) depicted in Figure 3, we expanded our analysis to include two additional brain regions situated in the cerebellum: the arbor vitae (ARB) and the ansiform cruciform lobule (ANCR).We specifically include the cerebellum in our analysis due to the frequent occurrence of spike sorting issues in this area.In Figure 8 and 9, we present a comparison between our decoder and the spike-sorted decoder that utilizes all KS units across all the brain regions studied.According to Figure 8, our method consistently outperforms the decoder that relies on all KS units across all brain regions and the majority of IBL sessions.Furthermore, Figure 9 specifically demonstrates that our method consistently achieves superior decoding performance in the recorded regions of the cerebellum, where spike sorting quality issues are commonly encountered.This highlights the robustness and reliability of our method, particularly in challenging recording conditions.We compare our decoder to the spike-sorted decoder using all KS units across various brain regions and behavioral tasks.Each point in the scatter plot represents one session from the 20 IBL session previously described in the experiments section.Sessions with "good" sorting quality are depicted in green, while sessions with "bad" sorting quality are shown in purple.The majority of the sessions lie above the gray diagonal line, indicating that our method consistently outperforms the decoder relying on all KS units.All ARB ANCR Figure 9: Decoding comparisons across brain regions in the cerebellum.We evaluate our decoder against the spike-sorted decoder utilizing all KS units across various brain regions in the cerebellum.The scatter plot visualizes the results for each IBL session in the study, with all sessions depicted in purple indicating "bad" sorting quality.Note that the probes used in the cerebellum sessions were not uniformly implanted in the same set of brain regions.As a result, the number of available sessions for decoding in certain cerebellum regions is limited.Notably, the majority of sessions are positioned above the gray diagonal line, indicating that our method consistently achieves better decoding performance compared to the decoder relying on all KS units.This outcome highlights the robustness of our method, particularly in the context of the cerebellum where spike sorting quality issues are prevalent.
Figure 10: Visualizations of spike features employed for decoding.Spike localization features, (x, z), the locations of spikes along the width and depth of the NP1 probe, and waveform features, a, the maximum peak-to-peak (max ptp) amplitudes of spikes.Amplitude is measured in standard units (s.u.).u 1 and u 2 denote the first and second principal components (PCs) of the spike waveforms.
Effects of inclusion criteria for spike waveforms.To investigate whether the density-based decoder is performing better by using additional spikes that a spike sorter would miss, we conducted an experiment using different inclusion criteria for spike waveforms.We fitted our model using only spikes detected by Kilosort 2.5, and compared its performance to decoders using spike-sorted outputs and our subtractionbased spike detection on choice and motion energy decoding.The results are summarized in Table 5.As shown in the table, our decoder can achieve comparable or better decoding performance than the spike-sorted decoder when modeling the same spikes.This suggests that the gain in decoding performance can be attributed to the density-based approach, instead of the spike inclusion criteria.Application to tetrodes.We utilized the code provided in the GitHub repository * of the clusterless point process decoder (Denovellis et al. 2021) to generate simulated neural and behavioral data.This synthetic dataset was designed to mimic recordings from 5 tetrodes, with each tetrode containing 4 channels.Spike amplitudes from each channel of these multiple tetrodes were selected as the spike features for both decoding methods.The objective of the decoding was to estimate the animal's position from these simulated spike features.As the original simulated position was too straightforward to decode, we intentionally distorted it by blending it with real position data sourced from the GitHub repository † .Moreover, we introduced random Gaussian noise to the simulated position for added complexity.
Application to HD probes.We evaluated both decoders on decoding wheel speed and motion energy from spike features extracted from NP1 probes across three IBL datasets.To preprocess the data, we followed the pipeline outlined above, extracting spike localization features and maximum peak-to-peak amplitudes as common features for both decoders.
We only focus on continuous behaviors for decoding since the clusterless point process decoder is designed exclusively for continuous behaviors.Notably, due to its continuous time nature, the clusterless point process decoder directly decoded behaviors without time binning.In contrast, our density-based model required time binning of both behavioral and spike feature data into equal-sized time intervals.Consequently, we used the time-binned behaviors for decoding with the density-based approach.We employed 5-fold cross-validation to assess the decoding performance of both decoders.We used a random walk as the state transition for the clusterless point process decoder, and used the estimated variance from the animal's behavior to set the variance of the random walk.We conducted simulations to illustrate the principles of our method.The simulation aimed to show that our encoding model can learn the relationship between spike features and behaviors.We performed two tasks, decoding a binary variable, y k , simulated from a Bernoulli distribution, and decoding a continuous variable, y k , simulated from a Gaussian process.To mimic the data-generating process, we selected Gaussian components with "templates" extracted from a real dataset.The encoding model parameters, b and β, were also taken from learned parameters in the same dataset.Given b, β and y k , we simulated the "firing rates" λ for each Gaussian component in the mixture, as described in the Method section of our paper.Next, we generated spike features based on these simulated "firing rates," and applied the encoding model to infer the behavior-dependent λ.

Relationship to spike sorting
We conducted experiments to investigate the biological interpretation of our MoG units and the correspondence between single cells identified by KS and our MoG units.The agreement matrix between "hard" KS spike assignments and "soft" MoG assignments is shown in Figure 12.We calculated the conditional probability of spikes belonging to each MoG component, given that these spikes belong to the corresponding KS unit.Notably, KS units with large amplitudes are less likely to be split into multiple Gaussian components.This shows a reasonable correspondence between the Gaussian components and the spike-sorted units.
Prob.Amplitude (s.u.) -based spike detection and denoising.We employ the iterative subtraction-based procedure for spike detection and collision-correction described in Boussard et al. 2023.Spike localization.We employ the method of Boussard et al. 2021 to estimate the location of each denoised spike.

Figure 6 :
Figure 6: Motion drift in "good" and "bad" sorting recordings.(a) The motion-registered spike raster plot of a "good" sorting example that is less affected by drift.(b) The spike raster plot of a "bad" sorting example, which is still affected by drift even after registration.

Figure 7 :
Figure 7: Model interpretation.(a)The posterior entropy of the spike assignment is high, when the posterior probability of spike assignment, q(z ikct ), is spread out among several MoG components instead of being concentrated at a single component.The scatter plot shows that low-amplitude spikes that are difficult to assign have higher posterior entropy than high-amplitude spikes.(b) Visualizations of the averaged weight matrices W's across trials in an example IBL session.For "W (spike-sorted and spike-thresholded)", the W matrix has one-hot rows and each entry W kct is the spike count that belongs to KS unit (channel) c and time bin t in trial k.The purple crosses indicate the "good" KS units.For "W (ours)", W kct is the sum of posterior probabilities of spike assignments, and the MoG mixing proportion π depends on the behavior y and changes over time.The arrangement of KS units (channels or MoG components) on the heat maps is based on the depth of the NP probe, ensuring comparability across the displayed W matrices.

Figure 8 :
Figure8: Decoding comparisons across brain regions in the thalamus, hippocampus and visual cortex.We compare our decoder to the spike-sorted decoder using all KS units across various brain regions and behavioral tasks.Each point in the scatter plot represents one session from the 20 IBL session previously described in the experiments section.Sessions with "good" sorting quality are depicted in green, while sessions with "bad" sorting quality are shown in purple.The majority of the sessions lie above the gray diagonal line, indicating that our method consistently outperforms the decoder relying on all KS units.

Figure 11 :
Figure 11: Our encoding model recovers the relationship between the simulated spiking activity and the simulated behavior correlate.Panel (a) shows a comparison of the simulated firing rates conditioned on the binary behavior variable with the learned firing rates by our encoding model.In Panel (b), we compare the simulated firing rates conditioned on the continuous behavior variable with the learned firing rates from our encoding model.

Figure 12 :
Figure 12: Correspondence between Kilosort and MoG spike assignment.Units are ordered by their depth on the Neuropixel probe.The color bar shows the conditional probability of spikes belonging to each MoG component, given that these spikes belong to the corresponding KS unit.The mean amplitude of each KS unit is shown at the bottom.
MoG assignment of spike feature i at time bin t in trial k η c = (µ c , Σ c ) mean and covariance matrix of MoG component c ctk behavior-dependent firing rate of component c at time bin t in trial k b c

Table 2 :
Table of notation.

Table 5 :
Effects of inclusion criteria for spike waveforms on decoding performance.

Comparison to a state-of-the-art clusterless decoder
In this section, we outline the specific experimental setup for evaluating our density-based decoder in comparison with the clusterless point process decoder (Denovellis et al. 2021) on both multiple tetrodes and high-density (HD) probes data.
The clusterless point process decoder uses a grid-based approximation of the inferred behavior, discretizing the behavior space into place bins; see Section "Decoding" in Denovellis et al. 2021.We determine place bin size based on the square root of the observed behavior variance.Denovellis et al. 2021 use kernel density estimation (KDE) to estimate the distributions of both the behavior variable and the spike features used for decoding.We set the KDE bandwidth that determines the amount of smoothing done for spike features to be 1.0, and the bandwidth for behavior to be the square root of the observed behavior variance; see Section "Encoding -clusterless" inDenovellis et al. 2021.