## Abstract

Neurobiological investigations of perceptual decision-making have furnished the first glimpse of a flexible cognitive process at the level of single neurons^{1,2}. Neurons in the parietal and prefrontal cortex^{3–6} are thought to represent the accumulation of noisy evidence, acquired over time, leading to a decision. Neural recordings averaged over many decisions have provided support for the deterministic rise in activity to a termination bound^{7}. Critically, it is the unobserved stochastic component that is thought to confer variability in both choice and decision time^{8}. Here, we elucidate this stochastic, diffusion-like signal on individual decisions by recording simultaneously from hundreds of neurons in the lateral intraparietal cortex (LIP). We show that a small subset of these neurons, previously studied singly, represent a combination of deterministic drift and stochastic diffusion—the integral of noisy evidence—during perceptual decision making, and we provide direct support for the hypothesis that this diffusion signal is the quantity responsible for the variability in choice and reaction times. Neuronal state space and decoding analyses, applied to the whole population, also identify the drift diffusion signal. However, we show that the signal relies on the subset of neurons with response fields that overlap the choice targets. This parsimonious observation would escape detection by these powerful methods, absent a clear hypothesis.

## Introduction

Difficult decisions require our brains to acquire many samples of evidence before making a choice. Evidence from psychophysics^{9–11} and neurophysiology^{8} support the idea that perceptual decisions are explained by a quantity, termed a decision variable (DV), which evolves as the sum of accumulated signal and noise—that is, deterministic drift plus stochastic diffusion. Critically, this formulation predicts that each decision is explained by a unique *drift-diffusion* trajectory that determines the outcome of the decision and the time of its completion. Recordings from single neurons in parietal and prefrontal association cortex in monkeys provide indirect support for this quantity, based on second order statistical properties from many decisions^{12} or successful application of machine learning to predict decision outcome^{13}. However, to date, drift-diffusion has not been observable on single trials, making it impossible to assess whether individual decisions are governed by this process.

Here we provide the first direct evidence for a drift-diffusion process underlying single decisions. We recorded simultaneously from up to 203 neurons in the lateral intraparietal area (LIP) while monkeys made perceptual decisions about moving random dots. Using a variety of dimensionality reduction techniques, we show that a drift-diffusion signal can be detected in such populations on individual trials, and that this signal satisfies the criteria for a decision variable. Notably, the signal of interest is dominated by a subpopulation of neurons whose response fields overlap one of the choice targets, consistent with earlier single neuron studies^{1,7,8,12}.

## Results

Two monkeys made perceptual decisions, reported by an eye movement, about the net direction of dynamic random dot motion (RDM; Fig. 1a). We measured the speed and accuracy of their decisions as a function of motion strength (Fig. 1b). The choice probabilities and the distribution of reaction times (RT) are well described by a bounded drift diffusion model (Fig. 1b,c). The fits of this model also specify the evolving probability distributions of the decision variable. They will guide our interrogation of the neural data.

During the task, we used Neuropixels probes to record simultaneously from 54–203 neurons in area LIP, representing a random and unbiased sample of neurons within the targeted recording location (Fig. 2a). This location, deep in the intraparietal sulcus was previously inaccessible using existing high density recording devices. We conducted eight sessions in two monkeys (1696–2894 trials per session). In what follows, we highlight one representative session from monkey M.

Our initial analyses focus on the subset of neurons with response fields that overlap the contralateral choice target (termed T_{in}) and show spatially selective persistent activity on an oculomotor delayed response task^{14}—the type of neuron previously selected for study in single-cell recording experiments. Fig. 2b shows the activity of 17 such neurons, averaged across neurons and trials of the same motion coherence in this representative session. The average firing rates from this ensemble replicate earlier findings: ramp-like trajectories during decision formation with buildup rates that depend on the strength and direction of motion and stereotyped firing rates just preceding saccadic choices for T_{in}. The motion-dependent separation of responses is thought to mark the beginning of decision formation in LIP, *∼*200 ms after onset of the RDM. While the build-up is thought to reflect the deterministic component of the accumulation of momentary evidence, the diffusion process is elusive. This is because noise is suppressed through averaging across trials, and for a single neuron, the spikes are too sparse to permit inferences of time-varying rate over the appropriate scale. With a pool of 17 neurons, however, it is now possible to measure the rate on single trials.

Each trace in Fig. 2c shows the average firing rate of the 17 T_{in} neurons,, on single trials. They are redolent of drift-diffusion. The traces contain many alternations in the sign of their derivative (even after smoothing), consistent with an accumulation of independent samples of positive and negative numbers. The traces spread apart by , because the variance of a running sum of independent identically distributed random samples is the sum of the individual variances (*σ*^{2} ∝*t*). This relationship holds for ∼100 ms for the weakest motion strength. As shown below, the attenuation of the spreading at later times is a sign of a termination bound that limits the range of the accumulation. It thus appears that the ramp-like averages in Fig. 2b belie drift-diffusion processes on single trials.

We hypothesized that the single trial average firing rate, , approximates the decision variable ^{8}. If so, the diffusion component should explain the variability of choice and reaction time for trials sharing the same direction and motion strength. Specifically, (*i*) a sample of should be predictive of choice and correlate inversely with the RT on trials that result in T_{in} choices; (*ii*) later samples ought to predict choice better and correlate more strongly (negatively) with RT than earlier samples, and (*iii*) later samples should contain most, if not all, of the information present in the earlier samples. As we will show, these predictions are borne out.

As shown in Fig. 2d, a sample of taken near the beginning of decision formation is negatively correlated with the RT on that trial (black line), even though we are only considering trials with long RTs (≥670 ms). The correlation is statistically significant for *t* ≥250 ms, which is only 50 ms after the responses begin to exhibit a dependency on motion strength and direction. The magnitude of the correlation is stronger with later samples, and critically, the later samples contain most of the information in the earlier sample, thereby rendering a reduced partial correlation, given the later sample (dashed line). For example, , whereas 200 ms later , and the partial correlation . The later sample mediates 66% of the earlier sample’s coefficient of determination (*R*^{2}). The same pattern holds for the choice (Fig. 2e). Early samples of exert positive leverage (*β*) on the probability of a T_{in} choice; later samples have more leverage and explain away the leverage of earlier samples from logistic regression (GLM). We compute a simple index of mediation using the change in *β* when the later time point is absent or included in the regression (see Methods). In the example in Fig. 2e, the mediation at *t* = 0.35 s is complete (across-session mean ± sem = 74 ±9%; *p <* 0.001, Fig. 2f, Fig. S1). Thus is an evolving representation that controls the choice and decision time on a single trial—the decision variable (DV), consistent with the hallmarks enumerated above.

### The representation of the DV is concentrated

Up to now, our focus has been on the same neurons we would have selected in single-neuron recording experiments, in this case screened *post hoc* for spatially selective persistent activity associated with the contralateral choice-target. We pursued several strategies to assess whether the representation is more broadly distributed among the population of simultaneously recorded neurons. The first is a targeted approach to find the weighting of all 191 neurons that best approximates the bounded drift-diffusion process inferred from behavior, while remaining agnostic about the role of the 17 T_{in} neurons.

We fit the choice and RTs on the session using a bounded drift-diffusion model, structured as a race between two partially anticorrelated drift-diffusion processes (Fig. 1c). The architecture approximates the neural organization— populations of neurons that accumulate evidence for right vs. left and populations that accumulate evidence for left vs. right. Each has an upper stopping bound^{8}. The fits furnish an estimate of the time-varying distribution of the DV associated with each motion strength and direction. The estimates incorporate the fact that trials terminate by reaching a positive bound in either of the competing processes. This gives rise to nonlinear functions of the expectation and variance as a function of time (Fig. 3a). We optimized the weights for each neuron such that the time-varying distribution of the weighted average firing rates best matched the model’s predicted distribution (see Methods). The vector of weights, **d**, define the direction in the 191-dimensional neuronal state space that best capture the one-dimensional drift-diffusion decision variable. As shown in Fig. 3b, the projection of activity on **d**, *S*_{d}(*t*), exhibits time dependent mean and variance similar to the model predictions.

The theoretical means and variances are also approximated by the 17 T_{in} neurons (Fig. 3c). Indeed, the unweighted average firing rates, (*t*) are similar to *S*_{d}(*t*) on individual trials (Fig. 3d). The histograms in Fig. 3e show correlations for all trials from this session. They show a strong correlation between *S*_{d} and (median ), which is reduced significantly when the T_{in} neurons are excluded from **d** to construct the 174-dimensional **d**^{−} (median , KS-test; Table S1), indicating a large contribution of the T_{in} neurons to **d**. In fact, the population activity projected onto **d** renders spatially selective persistent activity to targets in the same region of the visual field as the T_{in} neural response fields (Fig. 3f; 7 of 8 sessions, Fig. S2). It is therefore unsurprising that early samples of *S*_{d}(*t*) exhibit correlations with choice and RT. Nor is it surprising that a sample of *S*_{d}(0.55) mediates the correlation of earlier samples of *S*_{d}(*t*) (dashed curve). Surprisingly, however, a sample of also mediates earlier samples of *S*_{d}(*t*) (copper curve). We observe this pattern of mediation across measures in all but one session for which a later sample of does not mediate earlier activity of *S*_{d} (Fig. S3).

Clearly, the 17 T_{in} neurons play an important role in rendering the decision. We assume they are but a subset of the larger pool of neurons with response fields that overlap the same choice target—hence the partial mediation. Yet, they do not stand out on the basis of their weighting coefficients (Fig. S7)—not even when the weights are penalised so as to be sparse (L1 regularization). Thus without prior knowledge, we might not have recognized the simple logic of the decision process represented by neurons that inform the saccadic report. Taken together these analyses support the hypothesis that the DV is a one dimensional process^{15} that is largely explained by the 17 T_{in} neurons.

### Data driven approaches

Multineuron recordings present an opportunity to discover unanticipated features of data, beyond the imagination of the scientist and existent hypotheses. Such efforts are aided by a growing set of tools, devised to facilitate discovery of neural computations realized in the neuronal state space. Here we ask whether three data driven methods would discover diffusion and the importance of the T_{in} neurons had we not known about the latter and looked for the former. The answers to these questions may be relevant to many types of experiments that seek to exploit large scale neural recordings.

We first trained a logistic decoder to predict the monkey’s choices using the full population of 191 neurons in the sample session. We used all coherences and allowed a unique set of decoding weights at each time step (similar to Ref. 13; see Methods). As shown in Fig. 4a, the cross-validated accuracy of this *optimal* decoder increased as a function of time from motion onset. To estimate the upper bound of decoding, we performed the same analysis on data simulated with the diffusion model fit to the behavior. The peak performance of this decoder is 72% (Fig. 4a). The population decoder rivals this performance, and remarkably, so does the application of a threshold to *S*_{d}(*t*) and to predict choice. Indeed, removal of the 17 T_{in} neurons from the population reduces decoding performance far in excess of what would be expected by removing a random 17 neurons (average reduction in decoding accuracy of 8% and 1 ± 1%, mean ± sd, respectively).

While the decoder assigns a new set of weights to each time point, Fig. 4b shows that weights derived as early as 300 ms after motion onset (100 ms from the beginning of decision-related activity) perform nearly as well when those same weight are applied at all other times^{16}. This observation is consistent with the characterization of the process as one-dimensional, supported by the mediation analyses (*e*.*g*., Fig. 3g,h). We applied decoder weights derived at *t* = 0.35 s to define a direction in the neuronal state space to produce time dependent signals on single trials. The across-trial means exhibit coherence-dependent drift and the single-trials bear semblance to diffusion. The latter have leverage on choice, by design, and they correlate inversely with RT. Both effects (at *t* = 0.35 s) are mediated by a later sample of the same signal (*self*, RT: 67.4%; choice: 62.8%) and also by (RT: 81.1%; choice: 66.9%). The top row of Fig. 4f shows the mean percent mediation for the eight sessions (see also Fig. S4).

We also applied Principal Components Analysis (PCA) to trial-averaged standardized firing rates, using the epoch from 200 ms to 500 ms from motion onset (as in Fig. 2b). The first three PCs (Fig. 4c) capture 57% of the total variance. We use just the first PC (39% of variance) to render single-trial signals. The trial averaged means show clear dependence on motion strength and direction and single-trial traces are redolent of diffusion. They are positively correlated with (Fig. 4e), and their correlation with behavior is mediated by on most sessions (Fig. 4f, Fig. S5). We obtain similar results using demixed PCA^{17} (Fig. S5).

We conclude that PCA has the potential to reveal drift-diffusion on single trials in most of our experiments. This might seem surprising initially, as the PCs are derived from trial-averaged firing rates, which suppress the diffusion component. However, the PCs assign weight to the neurons that represent the drift, and those averages comprise drift-diffusion on single trials. However, like the choice decoder, which also mimics diffusion, PCA (and dPCA) do not expose the simplicity of the organization around the T_{in} neurons (Fig. S7). An inquisitive experimenter could discover this organization by noticing a visual response to the targets and a perisaccadic response associated with T_{in} choices.

Is there any feature of the data that might provide a clue to the data analyst about the importance of the T_{in} neurons? Although it is not obvious on single sessions, pooling the eight sessions reveals that the largest weights of **d**, the choice-decoder and PC1 are assigned to T_{in} neurons (Fig. S7). However, this does not reveal the underlying logic. Inspired by Kiani *et al*.^{18}, we conducted a spectral analysis of the pairwise correlation matrix using spike counts in a short epoch before the onset of RDM (Fig. 4d). Clusters of positively correlated neurons are apparent along the main diagonal. We define Cluster-1 (arrow) as the cluster with the strongest single-trial correlations with . Thus, it is unsurprising that the single-trial traces from Cluster-1 correlate strongly with (Fig. 4e). In the example session, this cluster also stands out as the one with the largest average correlation , but this was not a reliable indicator across sessions. That said, 9 of the 16 neurons in this cluster are T_{in} neurons, and across all sessions, 43 ± 11% of T_{in} neurons are in Cluster-1 (Table S2). Thus, without prior knowledge of the T_{in} neurons, spectral clustering might bring them to our attention. Many are in the clusters with the stronger .

Taken together, we conclude that even without a hypothesis, it is possible to discover drift-diffusion and its status as a decision variable. Absent knowledge of T_{in}, one could recognize that it is represented by a low (one) dimensional subspace of the neuronal state space. While accurate, the characterization would obfuscate the simple biological organization.

## Discussion

We have directly observed the neural representation of a stochastic decision variable—the accumulation of noisy evidence that determines choice and RT on single decisions. The signal is evident in the mean firing rate of small pools of neurons with response fields that overlap a choice target (T_{in} neurons) and also in the weighted sum of the full population of 54–203 sampled neurons, the neuronal state space (NSS), which includes these T_{in} neurons. It has the hallmarks of a decision variable: (*i*) it is correlated with both choice and RT, (*ii*) the magnitude of the correlations increase over time, as the decision is forming, (*iii*) later samples of the signal mediate correlations of earlier samples, (*iv*) its time-varying mean and variance (for each motion strength) are consistent with a drift-diffusion process, and (*v*) although not shown here, the same signals lead to termination of the decision process^{19}. We focused on the early epoch of decision formation because this is where the drift-diffusion signal is least distorted by the stopping bound, trial dropout, and response features associated with saccade initiation^{19}.

The finding validates the conclusions from single-neuron studies, which inferred one-dimensional diffusion dynamics using indirect methods^{12,15}. It also complements the choice-decoding strategy applied in Peixoto et al. (Ref. 13) to population recordings from the prefrontal and motor cortex. Choice decoders, like ours and Peixoto’s, are machine learning tools designed to predict the choice, but they are not the same as the actual DV. Unlike a choice-decoder, a real DV does not represent the probability of a choice but the signal that *determines* the choice and RT. The choice and RT are stochastic because the drift-diffusion signal is the accumulation of noisy evidence. While the stochastic drift-diffusion cannot be detected in trial-averaged firing rates, we can now recognize that the coding of a stochastic quantity conforms to the same principles as the encoding of simpler, deterministic quantities, such as direction selectivity—that is, pooling of spikes from many weakly correlated neurons with similar selectivity^{20,21} (but see Ref. 22). This explains why the sample of T_{in} neurons mediates strongly and also why it does not mediate completely, as these are only a sample of the full population.

A decision variable is just one of many neural representations that underlies a single instance of a mental process. The capacity to record simultaneously from an ever increasing number of neurons opens the possibility of studying the mechanisms underlying such instances. In this spirit we are encouraged by the capacity of purely data driven methods to “discover” diffusion. However, none of the methods we tried would identify the important T_{in} neurons. That is, they do not reveal the spatial organization of the response fields. Based on the experience here, one could imagine a path to their discovery by testing neurons identified by correlation-based clustering or large weights in PC1, say. For example one might notice that the responses of such neurons are correlated with the single trial diffusion signal rendered by PC1, and in addition they respond to both the onset of the choice targets and just preceding contraversive saccades. It would be natural to map the neural response fields and recapitulate the single neuron studies that discovered spatially selective persistent activity^{14}. Thus, while they do not guarantee discovery of the underlying neural organization, data driven approaches may point to observations that lead to discovery through hypothesis-driven inquiry and experiments.

These hypothesis-driven steps are essential, in our view, if we desire an explanation of brain function in the language of biology, in addition to mathematics. The present findings identify the DV with a particular functional cell type in LIP. The ability to observe the idiosyncratic, time-varying brain signal that gives rise to one decision at a moment in time opens a new lens to view neural computations that are only revealed on single instances. For example, the capacity to observe a decision variable on a single trial opens the possibility of elucidating interactions between LIP and other nodes of the network involved in decision making. As we show in the companion paper^{19}, single-trial firing rates in functionally connected T_{in} neurons in LIP and the superior colliculus perform distinct computations, which form and terminate the decision, respectively. One wonders what biological parsimony might hide in other neuronal state spaces.

## Methods

### Ethical approval declarations

Two adult male rhesus monkeys (*Macacca mulatta*) were used in the experiments. All training, surgery, and experimental procedures complied with guidelines from the National Institutes of Health and were approved by the Institutional Animal Care and Use Committee at Columbia University. A head post and recording chamber were implanted using aseptic surgical procedures.

### Behavioral tasks

The monkeys were trained to interact with visual stimuli presented on an CRT video monitor (Vision Master 1451, iiyama; viewing distance 57 cm; frame rate 75 Hz). They were trained to control their gaze and make saccadic eye movements to peripheral targets to receive a liquid reward (juice). The direction of gaze was monitored by an infrared camera (EyeLink 1000; SR Research, Ottawa, Canada; 1 kHz sampling rate). The tasks involve stages separated by random delays, distributed as truncated exponential distributions
where *t*_{min} and *t*_{max} define the range, *λ* s is the time constant, and *α* is chosen to ensure the total probability is unity. Below, we report the range and the exponential parameter *λ*. Note that because of truncation, the expectation 𝔼 (*t*) *< t*_{min} + *λ*.

In the *random dot motion* (RDM) task, monkeys are trained to make decisions about the net direction of motion in a dynamic random dot motion display. The RDM is confined to a circular aperture (diameter 5 dva; degrees visual angle) centered on the fixation point (dot density 16.7 dots ·’dva^{−2} ·s^{−1}). The task flow is shown in Fig. 1a. The random wait from onset of the choice targets to onset of the RDM is (0.25–0.7 s, *λ* =0.15). The direction (bottom left or top right) and strength of motion are determined randomly from ± {0, 3.2, 6.4, 12.6, 25.6, 51.2} % coherence (coh). The sign of the coherence indicates direction (positive for leftward). The values control the probability that a dot plotted on frame *n* will be displaced by Δ*x* on frame *n* + 3 (Δ*t* =40 ms), as opposed to randomly replaced. The displacement is consistent with velocity 5 dva s^{−1} (see Ref. 7 for additional details). The monkey is rewarded for choosing the correct target (trials with 0% coh were rewarded randomly). Errors are punished by extending the intertrial interval by up to 3 s (see Ref. 19 for additional details). On approximately half of the trials, a 100 ms pulse of weak motion (± 4% coh) was added to the RDM stimulus at a random time (0.1–0.8 s, *λ* =0.4) relative to RDM onset (similar to Ref. 23). Monkey M performed 9684 trials (5 sessions); monkey J performed 8142 trials (3 sessions). The data are also analyzed in a companion paper^{19}.

In the *oculomotor delayed response* task (ODR;^{14,24}), one target was flashed briefly (200 ms at a pseudorandom location in the visual field. After a variable delay (0.4–1.1 s, *λ* =0.3 for monkey M, 0.5–1.5 s, *λ* =0.2 for monkey J), the fixation point was extinguished and the monkey made a saccade to the remembered location of the target ±2.5 dva to receive a reward. This task was conducted to provide a rough characterization of the neural response fields during the visual, perisaccadic and delay epochs. They were also used to assess the stability of the recording over the session. Neurons were designated T_{in} if they exhibited spatially selective persistent activity at the location of the response target in the visual hemifield contralateral to the recorded hemisphere. The example session contained 17 T_{in} neurons, and we refer to the unweighted mean firing rate as . These analyses were conducted *post hoc*, after spike sorting.

### Behavioral analyses

We fit a variant of the drift-diffusion model (Fig. 1c) to the choice-RT data from each session. Details of the model and the fitting method are described in Ref. 25. The model constructs the decision process as a race between two accumulators: one accumulating evidence for the T_{in} choice and against the T_{out} choice (*e*.*g*., left minus right) and one accumulating evidence for a T_{out} choice and against a T_{in} choice (*e*.*g*., right minus left). The decision (T_{in} or T_{out}) is determined by the accumulator that first exceeds its positive decision bound, at which point the decision is terminated. The races are negatively correlated with one another, owing to the common source of noisy evidence. We assume they share half the variance, , but the results are robust to a wide range of reasonable values. The decision bounds are allowed to collapse linearly as a function of time, such that
We used the method of images to compute the probability density of the accumulated evidence (*x*) for each accumulator as a function of time (*t*) using a time-step of 1 ms. We assumed that *x* = 0 at *t* = 0. The decision time distributions rendered by the model were convolved with a Gaussian non-decision time distribution, which summarizes sensory and motor delays, to generate the predicted RT distributions. The model has 6 parameters in total: *κ, B*_{0}, *α, µ*_{nd}, *σ*_{nd}, and *C*_{0}, where *κ* determines the scaling of motion strength to drift rate, *C*_{0} implements bias in units of signed coherence^{26}, *µ*_{nd} is the mean non-decision time and *σ*_{nd} is its standard deviation.

For analytic tractability using the method of images, the model has no lower bound. This leads to a distortion of the mean and variance for the strongly negative coherences (given that spike rates can not go negative). Therefore, for plotting purposes only (Fig. 3a), we implemented a simple version of a lower bound (at *− B*_{0}). That is, having obtained the distribution of DV for all time points, any density below the lower bound was placed at the lower bound before we calculated the means and variances across time.

### Neurophysiology

We used prototype “alpha” version Neuropixels1.0-NHP45 probes (IMEC/HHMI-Janelia) to record the activity of multiple isolated single-units from the ventral subdivision of area LIP (LIP_{v} ^{27}). We used anatomical MRI to identify LIP_{v} and confirmed its physiological hallmarks with single-neuron recordings (Thomas Recording GmbH) before proceeding to multi-neuron recordings. Neuropixels1.0-NHP45 enable recording from 384 out of 4416 total electrical contacts distributed along the 45 mm long shank. All data presented here were recorded using the 384 contacts closest to the tip of the probe (Bank 0), spanning 3.84 mm. Reference and ground signals were directly connected to each other, and connected to the monkey’s headpost. A total of 1084 neurons were recorded over eight sessions (54–203 neurons per session). (Table 1).

The Neuropixels1.0-NHP45 probe is connected to a standard commercially-available headstage for the Neuropixels1.0 probes, connected via the standard Neuropixels1.0 5m cable to the PCI eXtensions for Instrumentation (PXIe) hardware (PXIe-1071 chassis and PXI-6141 and PXIe-8381 I/O modules, National Instruments). Raw data were acquired using the SpikeGLX software (http://billkarsh.github.io/SpikeGLX/), and single-units were identified offline using the Kilosort 2.0 algorithm^{28,29}, followed by manual curation using Phy (https://github.com/cortexlab/phy). The spike times were then synchronized with task events acquired by the experimental control system (Rex^{30}) and OmniPlex (Plexon Inc.).

### Neural data analysis

The spike times from each neuron are represented as delta functions of discrete time (1 kHz) and convolved with a 50 ms boxcar filter to achieve a coarse representation of firing rate as a function of time, *s*_{ij}(*t*), on each trial *i* and each neuron *j*. We used an 80 ms boxcar filter to calculate correlation between neuron firing rate and behavior (Fig. 2d–f) because it led to more robust estimates of summary statistics (*e*.*g*., %-mediation for different choice of early and late time points).

For analyses that use standardized signals—derivation of **d**, PCA, and choice-decoding—we expressed the rates as z-scores based on the mean and standard deviation of the firing rate 100 ms after motion onset. In some sessions, this led to the exclusion of a small number of very low firing-rate neurons which did not show any activity on any trial in the normalization window (*i*.*e*., 75–125 ms after motion onset; Table 1). Those neurons were assigned zero weight and do not contribute to the degrees of freedom in statistical tests. In the example session, this eliminated 5 neurons, thereby reducing the effective number of neurons from 191 to 186. The same uniformly-weighted 50 ms filter was used for data visualization with the exception of examples of single-trial activity (Fig. 2c, Fig. 3d, Fig. 4b–d) for which the raw data was convolved with a truncated Gaussian filter generated using the Matlab function, *gausswin* (width = 80 ms, width-factor = 1.5, *σ ≈* 26 ms).

The expected mean and variance of the decision variable as a function of time and coherence (*µ*_{c}(*t*) and , respectively) was furnished by the fits of the behavioral model (see above). The model parameters govern the probability density of the DV for each coherence and time, and *µ*_{c}(*t*) and were computed using the density unabsorbed by the decision bounds. We then found the direction in neural state space (defined by **d**) for which projections onto it—rendering *S*_{d}(*t*)—best fit *µ*_{c}(*t*) and . Goodness of fit was quantified using the K-L divergence between the predicted and fitted distributions of the DV. For simplicity, we calculated K-L divergence under the assumption that these distributions are Gaussian:
where *p* is the DV distribution given by the behavioral model fit and *q* is the DV distribution given by *S*_{d}. The weights comprising **d** were optimized using a *quasi-newton* search algorithm (*fminunc*, MATLAB). The weights were applied to the normalized activity of each neuron and the fits were evaluated within the epoch 200–400 ms after motion onset. Because it is always the case that *µ*_{p}(0) = 0 and , we enforced that the same be true for *µ*_{q}(0) and by subtracting *S*_{d}(0) from *S*_{d}(*t*) on each trial. We initially encouraged sparsity in the weight vector by adding a term to the cost function (similar to L1 regularization), where *λ* was optimized through cross-validation. As results were robust to changes in *λ*, final weight vectors were fit without a regularization term. To determine to what degree the representation of this diffusion process was driven by activity, we removed all T_{in} neurons from the population and then refit the weight vector . We refer to the projection of the data onto this NSS direction as *S*_{d}^{−}. We refer to the projection of the population response on **d** as *S*_{d} and use *S*_{d}(*t*) to render single trial firing rates (in arbitrary units, owing to standardization).

*Spatial selectivity of* d

To create the heat map in Fig. 3f we used the mean of *S*_{d}(*t*) during the memory-delay period of the ODR task (300–600 ms after target onset). Equivalent maps for all sessions capturing the visual response to the target (100–300 ms after target onset), the delay period activity and activity just before the saccade (250 to 50 ms before the saccade) are shown in Fig. S2.

#### Choice decoder

For each experimental session, we trained a linear choice decoder on 50% of trials in sliding 50 ms windows between 100 and 500 ms after motion onset. Decoder accuracy was cross-validated using the activity of held-out trials at the same time point. We additionally computed a logistic regression of and activity against choice at each time point and compared their cross-validated choice prediction accuracy against the performance of the choice decoder and the theoretical peak decoding performance determined through simulations (Fig. 4a). To evaluate whether the optimal choice decoder represented a one-dimensional signal, we tested whether decoding accuracy was stable using the same weights over time^{16}. Decoder weights generated for one time point were therefore applied to the held-out neural data at all other time points to then recomputed the decoding accuracy. With such time-independence established (Fig. 4b, left), a decoder dimension was generated by applying the decoder weights fit to the activity at *t* = 0.35 s after motion onset to all trials and time points.

#### Principal Components Analysis

We performed PCA on the normalized within-coherence mean activity from 200 ms after motion onset to the earlier of two events: 500 ms after motion onset or median RT minus 100 ms. Visual inspection of the Principal Components (PC) indicated that this yielded typically one or a small number of PCs that represented an evidence-dependent rise in activity after motion-stimulus onset. To determine which PC was most likely to reflect decision-related activity on single trials, we computed the correlations between projections of the neural data onto the first ten PCs with . In seven out of the eight experimental sessions, the first PC yielded the greatest correlation coefficient, highlighting that not only the average but also trial-to-trial variations in resembles the main signal embedded in LIP activity in this task. Neural data projected onto this PC was then evaluated against our criteria for a decision variable (see below).

#### Spectral clustering

We used spectral clustering to group neurons into four clusters, using the correlation matrix between neuron pairs to construct the similarity matrix. For each neuron and trial, we counted the number of spikes in a time window between *−*200 and 0 ms relative to motion onset. We then calculated the Pearson correlation coefficient (*ρ*_{ij}), using these spike counts, for every pair of neurons *i* and *j*. The similarity (affinity) matrix was defined as *ρ*_{ij} *−* min(*ρ*), where min(*ρ*) is the smallest element of the correlation matrix. *k*-means clustering with 4 components was used to cluster the eigenvalues of the (random-walk normalized) Laplacian matrix^{31}. We again computed the correlations between the mean activity in each cluster with to determine which cluster was the most likely to yield diffusion dynamics. The mean activity of the most-highly correlated cluster was used to generate a spectral clustering dimension (Cluster-1).

#### Demixed principal component analysis (dPCA)

We use dPCA^{17} to reduce the dimensionality of the data while attempting to separate the motion-strength dependent components. For each neuron, we calculated the mean spike-counts across trials within coherence (in 10 ms bins) in the epoch 200–500 ms from motion onset. These signals were z-scored within neuron before we derived 5 dPCs using Matlab code provided by Ref. 17 (http://github.com/machenslab/dPCA). We identified the component (dPC1) that rendered coherence-dependent mean signals (Fig. S8a).

#### Correlation between single-trial activity and behavior

We used measures of correlation between the value of single-trial signals and both the RT and choice on that trial, restricting to trials with reaction times between between 0.670–2 s. *Single-trial signal* refers to the projections of the neural population activity onto vectors of weights (*e*.*g*., *S*_{d}(*t*)). For T_{in} neurons and Cluster-1 (Fig. 4d) it is the average firing rate of the neural responses as a function of time. We computed the Pearson correlation, *ρ*(*t*), between reaction time and residual neural activity (mean-subtracted within motion coherence in a sliding 80-ms time window) on trials that resulted in a T_{in} choice. Negative correlation indicate that greater neural activity at a given time point was predictive of earlier T_{in} choices. We also compute the partial correlation conditional on the residual at *t* = 0.55 s. The reduction in R^{2} induced by the later sample is the %-mediation, which we limit to the range 0–100%.

Correlation with a T_{in} choice was established using the coefficient, *β*_{1}, in logistic regression (same restriction on RT):
Analogous to partial correlation, we included the later time point and fit
and defined mediation as
As represented here, the mediation of the earlier *S*(*t*) is by a later sample of the same signal (self). We prefer the simplicity of this definition of mediation to others based on *pseudo-R*^{2}. For both types of mediation, we also test whether the earlier *S*(*t*) is mediated by the T_{in} neurons by substituting for *S*(0.55) in Eq. 6.

The box-whisker plots (Figs. 2 to 4) summarize the distribution of %-mediation statistics for *t* = 0.35 s across experimental sessions. To determine statistical significance, we compare the %-mediation of data pooled across sessions against a null distribution generated by permuting, within session and signed coherence, the trials used for the later time point (*N* = 1000).

#### Similarity to

To quantify the similarity of the stochastic (diffusion) component of to other single-trial signals, we compute the correlation coefficient between the pair of detrended signals produced on the same trial, from *t* = 0.2 s to *t* = *RT −* 0.1. Detrending removes the mean response of trials of the same signed motion coherence. The r-values from each trial are summarized by histograms in Figs. 3 and 4. We used a one-tailed Kolmogorov-Smirnov (KS) test to compare histograms.

## Author contributions

G.M.S. and M.N.S. designed the experiment. N.A.S., G.M.S. and E.M.T. implemented the experimental set-up. G.M.S. collected the data. N.A.S., G.M.S., A.Z., D.M.W and M.N.S. analyzed the data. All authors wrote the manuscript.

## Additional information

Supplementary Information is available for this paper. Correspondence and requests for materials should be addressed to shadlen{at}columbia.edu.

## Extended Data

## Acknowledgments

We thank Shushruth for comments on the manuscript, Cornel Duhaney and Brian Madeira for their assistance in the planning and execution of surgeries, animal training and general support, and we thank Columbia University’s ICM for the quality of care they provide for our animals, especially during the pandemic and lockdown. We would further like to thank Tanya Tabachnik and her team at the Zuckerman Institute Advanced Instrumentation Core and Tim Harris, Wei-lung Sun, Jennifer Colonell, and Bill Karsh at HHMI Janelia for their continued support with Neuropixels1.0-NHP45 probes development and testing. This research was supported by the Howard Hughes Medical Institute; an R01 grant from the NIH Brain Initiative (M.N.S., R01NS113113); a T32 and F31 grant from the National Eye Institute (G.M.S, T32 EY013933, F31 EY032791); the Grossman center; and the Brain and Behavior Research Foundation.

## Footnotes

References updated