A universal probabilistic spike count model reveals ongoing modulation of neural variability

Neural responses are variable: even under identical experimental conditions, single neuron and population responses typically differ from trial to trial and across time. Recent work has demonstrated that this variability has predictable structure, can be modulated by sensory input and behaviour, and bears critical signatures of the underlying network dynamics and computations. However, current methods for characterising neural variability are primarily geared towards sensory coding in the laboratory: they require trials with repeatable experimental stimuli and behavioural covariates. In addition, they make strong assumptions about the parametric form of variability, rely on assumption-free but data-inefficient histogram-based approaches, or are altogether ill-suited for capturing variability modulation by covariates. Here we present a universal probabilistic spike count model that eliminates these shortcomings. Our method builds on sparse Gaussian processes and can model arbitrary spike count distributions (SCDs) with flexible dependence on observed as well as latent covariates, using scalable variational inference to jointly infer the covariate-to-SCD mappings and latent trajectories in a data efficient way. Without requiring repeatable trials, it can flexibly capture covariate-dependent joint SCDs, and provide interpretable latent causes underlying the statistical dependencies between neurons. We apply the model to recordings from a canonical non-sensory neural population: head direction cells in the mouse. We find that variability in these cells defies a simple parametric relationship with mean spike count as assumed in standard models, its modulation by external covariates can be comparably strong to that of the mean firing rate, and slow low-dimensional latent factors explain away neural correlations. Our approach paves the way to understanding the mechanisms and computations underlying neural variability under naturalistic conditions, beyond the realm of sensory coding with repeatable stimuli.


Introduction
Classical analyses of neural coding are based on mean spike counts or neural firing rates. Indeed, some of the most paradigmatic examples of the neural code were discovered by regressing neural firing rates to particular sensory stimuli [1,2] or behavioural covariates [3,4,5,6] to characterize their tuning properties. However, neural spiking is generally not regular. Recordings from many cortical areas show significantly different activity patterns within and across identical trials [7], despite fixing experimentally controlled variables. This irregularity is also seen in continual neural recordings without trial structure [8]. The resulting variability has classically been characterised as 'Poisson', with a Fano factor (variance to mean ratio) of one [9], but experimental data also often exhibits significantly more [10,8,11,12] and sometimes less [13,14] variability, respectively referred to as over-or underdispersion. Moreover, experimental studies have revealed that neural variability generally depends on stimulus input and behaviour [15,16,17,18], and exhibits structured shared variability ('noise correlations') across neurons even after conditioning on such covariates. Such correlations can have important consequences for decoding information from neural population activity [19,20,21] and reveal key properties of the underlying circuit dynamics [22]. Moreover, theories of neural representations of uncertainty have assigned computational significance to variability as a signature of Bayesian inference [23,24,25,26]. Thus, just as classical tuning curves for firing rates have been crucial for understanding some of the fundamental properties of the neural code, a principled statistical characterisation of neural variability, and its dependence on stimulus and behavioral covariates, is a key step towards understanding the dynamics of neural circuits and the computations they subserve.
The traditional approach to characterising neural variability has been pioneered in sensory areas, and relies on repeatable trial structure with a sufficiently large number of trials using identical stimulus and behavioral correlates [27,15,28]. Variability in this case can be quantified by simple summary statistics of spike counts across trials of the same condition. However, this approach does not readily generalise to more naturalistic conditions where covariates cannot be precisely controlled and repeated in an experiment. This more general setting requires statistical methods that take into account temporal variation of covariates for predicting neural count activity. Generalised Linear Models are a popular choice [29], but they only model the dependence of firing rates on covariates -with changes in variability directly coupled to changes in the rate inherent to Poisson spiking. More complex methods for inferring neural tuning [30,31] and latent structure [32,33,34] similarly use restrictive parametric families for spike count distributions, and thus also cannot model changes in variability that are not 'just' a consequence of changes in mean counts or firing rates. Conversely, statistical models capable of capturing arbitrary single neuron count statistics, such as histogram-based approaches or copulas [35], do not incorporate dependencies on covariates.
Here we unify these separate approaches, resulting in a single framework for jointly inferring neural tuning, single neuron count statistics, neural correlations, and latent structure. Our semi-parametric approach leads to a universal count model for counts ranging from 0 to K, in the sense that we can model arbitrary distributions over the joint count space of size (K + 1) N of N neurons. The trade-off between computational overhead and model expressivity is controlled by hyperparameters, with expressivity upper bounded by the true universal model. Our approach extends the idea of a universal binary count model [36] to a finite range of integer counts, while allowing flexible dependence on observed and latent covariates to model non-stationary neural activity and correlations. The flexibility reduces biases from restrictive assumptions in any of the model components. Scalability is maintained by leveraging sparse Gaussian processes [37] with mini-batching [38,39] to handle the size of modern neural recordings.
We first introduce the universal count model, and then describe how to interpret as well as evaluate model fits. As our model is able to capture arbitrary single neuron statistics, we build on the Kolmogorov-Smirnov test to construct more absolute goodness-of-fit measures. After validating our method on synthetic data that cannot be captured by currently used methods, we apply the model to electrophysiological recordings from two distinct brain regions in mice that show significant tuning to the head direction of the animal [40,41]. We find that (1) neural activity tends to be more regular than common Poisson-like models at higher firing rates, and more irregular at low rates; (2) mean and variance of counts defy a simple parametric relationship imposed by parametric count distribution families; (3) variability modulation by behaviour can be comparable or even exceed that of the mean count or firing rate; (4) a two-dimensional latent trajectory varying on timescales of ∼ 1 s is sufficient to explain away neural correlations but not the non-Poisson nature of single neuron variability. Finally, we discuss related work, limitations and proposed extensions of our model.

Universal count model
Notation Spike count activity of a population of N neurons recorded into T time bins is formally represented as an N -dimensional time series of non-negative integers. Due to biological constraints, the possible spike counts have some finite upper bound K, taken as the highest observed count. We denote probabilities of a spike count distribution (SCD) by a vector π of length K + 1. Additionally, we denote the count activity over neurons n and time steps t by a matrix Y ∈ [0, K] T ×N with elements y tn . The observed X ∈ R T ×Dx and latent Z ∈ R T ×Dz covariates (range depends on topology [42]) similarly consist of elements x td and z tq , respectively. Generally, we use capital versions of quantities to denote multidimensional concatenations. Examples show firing rate and Fano factor tuning curves over observed x and latent z covariates, either jointly (heatmaps) or marginalized (grey curves). The depth of modulation in marginalized tuning curves is used to extract a tuning index (TI) for the chosen subsets of covariates, see Equation 6.

Generative model
The big picture is to model counts Y with dependence on X. For each neuron, our model consists of C Gaussian process (GP) priors, a basis expansion φ : R C → RC, and a linear-softmax mapping where W ∈ R (K+1)×C , b ∈ R K+1 , and the GP covariance functions k cn have hyperparameters θ GP cn . The use of non-parametric GP mappings with point estimates for W and b leads to a semi-parametric model with parameters θ, see details in Appendix D. The overall generative model P θ (Y |X) is depicted schematically in Figure 1. Note the model specifies a prior p(Π|X) over joint SCDs, conceptually similar to Dirichlet priors [36] but allowing non-parametric dependence on X. With latent input Z, our model can flexibly describe multivariate dependencies in joint SCDs as conditional independence across neurons no longer holds when marginalizing over Z. In addition, p(Z) models temporal correlations in the latent states. We use Markovian priors (details in Appendix D) denoting θ pr with generative model parameters θ for compactness. This allows the model to flexibly capture both neural and temporal correlations in Y . To attain scalability, we use sparse GPs [37].
Depending on C and basis functions φ(·), we obtain an approximation to the true universal prior on joint SCDs, with 'universal' referring to the ability to capture any joint SCD over all neurons. Arbitrary single neuron statistics can be captured when C = K with φ(f ) = f , but is computationally expensive when N ×C 1. For capturing all correlations, the model also requires a sufficiently large latent space. One controls the trade-off between model expressiveness and computational overhead through C and φ. Larger expansions φ allow one to model count distributions more expressively with small C, e.g. the element-wise linear-exponential φ(f ) = (f 1 , e f1 , . . . , f C , e f C ) covers a range of distributions including the truncated Poisson with only C = 1 (see subsection B.3).

Stochastic variational inference and learning
For the joint model distribution p θ (Y, Π, Z|X) = P (Y |Π) p θ (Π|X, Z) p θ (Z), with count distributions P (Y |Π), we approximate the posterior by q θ,χ,ϕ (Π, Z|X) of the form with ϕ and χ the variational parameters for latent states and the sparse Gaussian process posterior (Appendix D), respectively. Note that we use a factorized normal q(Z) for Euclidean Z, and a wrapped normal for circular Z based on the framework of reparameterized Lie groups [43,42]. The posterior over count probabilities q θ,χ (Π|X, Z) is defined as mapping the sparse Gaussian process posterior q θ,χ (F |X, Z) through Π(F ) (Equation 1), a deterministic mapping. This is analytically intractable, so in practice it is represented by Monte Carlo samples. An upper bound on the negative log marginal likelihood can be minimized using stochastic variational inference [44] with D denoting observed data. This objective leads to tractable terms (subsection D.1), allowing us to infer the approximate posterior as well as a lower bound of the log marginal likelihood [45,39]. We use Adam [46] for optimization, see details of implementation and model fitting in Appendix D.

Obtaining interpretable spike count statistics from the model
Characterizing spike count distributions From the posterior q(Π|X) 1 , we can compute samples of the posterior of any statistic of spike counts as a function of covariates. Single neuron statistics in particular can be characterized by tuning curves for both mean firing rates and Fano factors (FF) with time bin length ∆. The model also quantifies private neuron variability that cannot be explained away by regressing to shared input (both observed and latent) through P n (y tn |x t ).
To quantify the sensitivity of a some aspect of neuron activity to a set of covariates x * , we define a tuning index (TI) with respect to a count statistic T y (x * ) that is evaluated under the mean posterior count distribution marginalized over all other covariate dimensions complementary to x * . These marginalized distributions are estimated using the input time series, see Appendix E. Resulting marginalized tuning curves used for TIs are depicted in Figure 1.
Generalized Z-scores and noise correlations The deviation of activity from the predicted statistics is commonly quantified through Z-scores [8,47,17], which are computed as (y − y )/ y with y being the mean count in the time bin. If neural activity followed a Poisson distribution, the distribution of Z asymptotically tends to a unit normal when N 1 (Appendix C). We generalize the Z-score using the probability integral transform and the inverse normal CDF Φ −1 (q) which removes the bias away from Gaussianity at low counts and also generalizes to arbitrary count distributions. The dequantization noise leads to continuous q and Z.
With the Z-score, one can completely describe single neuron statistics with respect to the model. Correlations in the neural activity however will cause Z-scores to be correlated. We define generalized lagged correlations r ij (∆) ∈ [−1, 1] and Fisher Z ∈ R that is more convenient for statistical testing which describes spatio-temporal correlations not captured by the model. Noise correlations [48] refer to the case of ∆ = 0, when r ij becomes symmetric. 1 For notational convenience, X denotes both observed and latent covariates here.

Assessing model fit
Our model depends on a hyperparameter C ≤ K that trades off flexibility with computational burden. In practice, one likely captures the neural activity accurately with C well below K and a simple basis expansion as the linear-exponential above or quadratic-exponential φ(f ) = (f 1 , f 2 1 , e f1 , . . . , f 1 f 2 , . . .). This can be quantified by the statistical measures provided below, and allows us to select appropriate hyperparameters to capture the data sufficiently well.
To assess the model fit to neural spike count data, a conventional machine learning approach is to evaluate the expected log likelihood of the posterior predictive distribution on held-out data Y , leading to the cross-validated log-likelihood where we cross-validate over the neuron dimension by using the majority of neurons to infer the latent states q(Z) in the held-out segment of the data, and then evaluate Equation 9 over the remaining neurons. Without latent variables, we simply take the expectation with respect to q(Π|X). However, the cvLL does not reveal how well the data is described by the model in an absolute sense. Likelihood bootstrap methods are possible [28], but become cumbersome for large datasets. To assess whether the neural data is statistically distinguishable from the single neuron statistics predicted by the model, we use the Kolmogorov-Smirnov framework [49] along with and q and Z-scores from Equation 7 with empirical distribution function F (q), for details see Appendix C. This scalar number is positive and does not indicate whether the data is less or more regular than predicted by the model. A useful measure of dispersion is the variance of Z, in particular its logarithm which provides a real number indicating over-and underdispersion for positive and negative signs, respectively. T refers to the number of time steps or Z-score values. Its sampling distribution under Z ∼ N (0, 1) is asymptotically normal, centered around 0 with a variance depending on T (subsection C.3). This extends the notion of over-and underdispersion beyond Poisson reference models [50]. To quantify whether the model has captured noise correlations in the data, we can then compute Z-scores with respect to the mean posterior predictive distribution Correlations that are caused by co-modulation of neurons by low-dimensional factors can be captured with latent states Z inferred from the same data. Intuitively, this can be seen as treating latent states Z as if it was part of observed input or behaviour. Computing Equation 8 should then show a decrease in correlations r, as the Z-scores are whitened under the posterior predictive distribution.

Results
In the following results, we use C = 3 with an elementwise linear-exponential basis expansion as described in subsection 2.1. This empirically provided sufficient model flexibility to capture both the synthetic and real data as can be seen in goodness-of-fit metrics. We use an RBF kernel with Euclidean and cosine distances for Euclidean and angular input dimensions respectively (Appendix D).

Synthetic data
Animals maintain an internal estimate of their head direction in particular circuits of their brain [4,41,51]. Here, we extend simple statistical models of head direction circuits [52] for validating the ability of the universal model to capture complex count statistics, as well as neural correlations through latent structure. The task is to jointly recover the ground truth count likelihoods, their tuning to covariates, and latent trajectories if relevant from activity generated using two synthetic populations. The first population was generated with a parametric heteroscedastic Conway-Maxwell-Poisson (CMP) model [53], which has decoupled mean and variance modulation as well as simultaneously over-and underdispersed activity (Fano factors above and below 1). The second population consists of Poisson neurons tuned to head direction and an additional hidden signal, which gives rise to apparent overdispersion [28] as well as noise correlations when only regressing to observed covariates. For mathematical details, see Appendix E (synthetic populations) and Appendix B (count distributions).
We compare our universal model to the log Cox Gaussian process or Poisson GP model [33] and the heteroscedastic negative binomial GP (NBh) model which places GP priors on both the rate and shape parameter, a non-parametric extension of [53]. The more flexible CMPh model, analogous to NBh, has difficulty in scaling to large datasets due to the series approximation of the partition function (Appendix B). To show the power of GP based approaches, we also compare to a universal model with an artificial neural network (ANN) mapping replacing the GP. For details of the baseline models, see Appendix E. For cross-validation we split the data into 10 roughly equal non-overlapping segments, and validated on 3 chosen segments that were evenly spread out across the data. When a latent space was present, we used 90% of the neurons to infer the latent signal while validating on the remaining neurons, and repeated this for non-overlapping subsets. We rescale the log likelihoods by the ratio of total neurons to neurons in subset and then take the average over all subsets to obtain comparable cross-validation runs to regression. Figure 2A shows that the universal model successfully captures nontrivial count statistics of the heteroscedastic CMP population. Baseline models cannot capture cases where the Fano factor drops below 1, and indeed are outperformed. In addition, we observe that using a Bayesian GP over an ANN mapping in the model leads to a reduction in overfitting, especially in the latent setting where the ANN model fails to recover the ground truth latent signal. Figure 2B shows that the modulated Poisson population activity is seen by a Poisson regression model as overdispersed, indicated by T DS . Our universal model flexibly captures the overdispersed single neuron statistics, independent from noise correlations r ij that are captured when we introduce a Euclidean latent dimension. As expected, the Z-score scatter plots shows whitening under the posterior predictive distribution when the correlations are captured.

Mouse head direction cells
We apply our universal model to a recording of 33 head direction cells in the anterior nucleus of the thalamus (ANT) and the postsubiculum (PoS) of freely moving mice [40,41]. Neural data was binned into 40 ms intervals, see details in Appendix E. Note that observed count statistics differ with bin sizes (Appendix A), which is expected as consecutive bins are not independent. Regression was performed against head direction (HD), angular head velocity (AHV), animal speed and position, and absolute time, which collectively form X D in this model. We used 64 inducing points for regression, and added 8 for every latent dimension added (Appendix D). Cross-validation was performed similarly to the validation experiments, except that we used subsets with 85% of the neurons to infer Z. Figure 3A shows that for regression NBh performs worst, likely due to overfitting, despite containing Poisson as a special case (only approximately reachable in practice though, see Appendix B). Only the universal model captures the training data satisfactorily with respect to confidence bounds for T KS and T DS , although the data remained slightly underdispersed to the model with T DS values slightly skewed to negative. Compared to the Poisson model, the cvLL is only slightly higher for the universal model as the data deviates from Poisson statistics in subtle ways. We see both FF above and below 1 (over-and underdispersed) across the neural firing range in Figure 3B, with quite some neurons crossing 1. Correspondingly, FF-mean correlations coefficients are often negative. Their spread away from ±1 indicates firing rate and FF do not generally satisfy a simple relationship, especially for examples such as cell 27. Furthermore, ANT neurons seem to deviate less from Poisson statistics. From Figure 3C, we note in particular that FFs tend to decrease at the preferred head direction, but rise transiently as the head direction approaches the preferred value. We also see that tuning to speed and time primarily modulates variability rather than firing rates. All of this is impossible to pick up with baseline models, which constrain FF ≥ 1 as well as FF increasing with firing rate (Appendix B). Finally, we see more tuning of the firing rate to position in PoS cells.
When adding latent dimensions, Figure 3D shows a peak in the cvLL at two dimensions, where correspondingly the Fisher Z samples are well described by a unit normal for the first time. Kernel length scales however did not indicate redundant latent subspaces for higher dimensions as expected for automatic relevance determination, possibly due to mixing of latent dimensions. Notice the noise correlation patterns in Figure 3E tend to show positive correlations for similarly tuned neurons roughly around the diagonal of blocks, as expected from ring attractor models [22]. Intrinsic neuron variability, roughly quantified by the average FF, further decreased and thus become even more underdispersed when considering additional tuning to latents, in particular for ANT. In addition, latent signals primarily modulate firing rate as seen from TIs in Figure 3F. When looking at time scales of covariates in Figure 3G (computed as the decay time constant of the autocorrelogram (Appendix E), the latent processes vary on time scales right in the gap of behavioural time scales.
Beside our main contribution of characterising the fine structure of neural variability, our results have another novel element. Using GP-based non-parametric methods, we successfully estimated  the tuning of cells to as many as 8 different covariates (6 observed + 2 latent) in a statistically sound fashion (even previous GP-based approaches considered a maximum of 4 covariates). Specifically, one of our covariates was absolute experimental time to capture non-stationarities in neural tuning. . We also applied the model in a purely latent setting similar to the example in Figure 2A, with the universal model uncovering a latent signal more closely correlated to the head direction compared to baseline models. These additional results are presented in Appendix A.

Discussion
Related work Neural encoding model provide a statistical description of neural count activity, and typically rely on a parametric count likelihood such as the Poisson [33], negative binomial [30,31] or Conway-Maxwell-Poisson distribution [53]. This choice is independent of the empirical count statistics and is often mismatched to the data. Heteroscedastic count models, characterized by input-dependent noise [58], additionally regress the dispersion parameter of count distributions to covariates [59,60]. This has shown improvements in stimulus decoding and more calibrated posterior uncertainties [53]. Copula-based models [35,61] separate marginal distributions of single neurons from the multivariate dependency structure of the population parameterized by the copula family, and thus do not place parametric constraints on single neuron count statistics. The idea of a universal model that can capture arbitrary joint distributions has been explored for binary spike trains [36], using Bayesian non-parametric models to provide regularization and flexibility [36,62]. However, neither approach naturally incorporates modulation of spike count distributions by input covariates.
Our model deals with discrete spike counts ranging from 0 to K in a manner similar to categorical output variables often considered in machine learning. Similar models have been proposed mainly in the context of Gaussian process classification [63, 64], which directly pass Gaussian process function points through a softmax nonlinearity. Our approach instead passes separate Gaussian processes through a linear-softmax mapping to compute count probabilities. Introducing unobserved input variables in a Gaussian process model leads to Gaussian process latent variable models [65, 66]. Such models have recently been applied to neural data to performs dimensionality reduction [33], with extensions to non-Euclidean latent spaces and non-reversible temporal priors [42,67]. Another avenue for future work could consider going completely non-parametric and adding a count dimension to the input space, which is evaluated at counts 0 to K for every time bin. This however increases the number of evaluation points by a factor K + 1. In addition, extending our model with more powerful priors for latent covariates, such as Gaussian process priors [33,67], can improve latent variable analysis, especially at smaller time bins where the temporal prior influence becomes more important. Regularization methods may help to decorrelate inferred trajectories [71,72].

Conclusion and impact
We introduced a universal probabilistic encoding model for neural spike count data. Our model flexibly capture both single neuron count statistics and their modulation by covariates. By adding latent variables, one can additionally capture neural correlations with potentially interpretable unobserved signals underlying the neural activity. We applied our model to mouse head direction cells and found count statistics that cannot be captured with current methods. Neural activity tends to be less variable at higher firing rates, with many cells showing both over-and underdispersion. Fano factors and mean counts generally do not show a simple relation and can even be decoupled, with Fano factor modulation comparable or in some cases even exceeding that of the rate. Finally, we found that a 2D latent trajectory with a timescale of around a second explained away noise correlations in these cells.
Neural variability is usually not considered on the same footing as mean firing rates, with models assigning most computational relevance to rates [73,74]. However, recent work on V1 has started to explore variability as playing a computationally well-defined useful role in the representation of uncertainty [24,25,22,26]. The framework introduced in this paper provides a principled tool for empirically characterising neural variability and its modulations -without the biases inherent in traditional approaches, which would likely miss potentially meaningful patterns in neural activities beyond mean rates. Our model has the potential to reveal new aspects of neural coding, and may find practical applications in designing and improving algorithms for brain-machine interfaces. As progress is made in scaling and applying such technology beyond research environments [75], it becomes increasingly more important to maintain transparency, e.g. through open source code, and to raise awareness of potential ethical issues [76].  ). The universal model outperforms baseline models, and interestingly its RMSE with respect to behavioural head direction is smallest. In fact, comparing to head direction shifted in time with respect to the spike train reveals the latent signal to be most correlated to behaviour ∼ −100 ms in the past. Errors bars in RMSE and cvLL are s.e.m. over cross-validation runs.

A Additional analysis of head direction cells A.1 Temporal bin sizes
In Figure 4B, we can see the sensitivity of the count analysis to bin length. Note when the bin size becomes very small leading to low spike counts, differences in count distributions matter less. In the limit of binary spikes, the count distribution will always be a Bernoulli distribution. In these cases, the latent trajectory dynamics becomes more important as it captures more of the correlations in time that were previously captured by non-Poisson count distributions in larger bins. The different results for different bin lengths is a consequence of consecutive bins being temporally correlated, leading to more extreme fluctuations in activity and thus higher or lower variability. For stationary renewal processes, an analytical treatment of Fano factor dependence on bin length is available [77].
Spike count based approaches are inherently more coarse grained than methods dealing directly with individual spike times based on point processes [1,2,49]. In particular, when studying phenomena at time scales comparable to the interspike intervals, such as theta precession [4,5], spike count binning may average away such effects if the bin size is too large. However, binning does reduce the number of total time points and may be more practical for studying large population activities recorded over long time periods. Point process models [6] are more suited for describing neural data at shorter time scales. However, dependencies of consecutive spike intervals are complicated to model, and often models based on Markovian dependencies called renewal processes are used [2,49]. Recent work with recurrent neural networks [7,8,9] and triangular mappings [10] have improved expressivity of such models.

A.2 Drifting and ATIs
Joint tuning curves can reveal neural representations that are not factorized over a set of covariates. The Bayesian nature of Gaussian processes takes care of undersampled regions that are rife in high dimensional input spaces, which is the setting in this work for studying joint tuning to behaviour.
In addition to behaviour, one can pick up representational drift [54, 55] by regressing against absolute time of the recording. The time regressor needs to be considered carefully as it may confound at time scales of latent trajectories. As long as the time scale in the kernel is much larger than the time scale over which the latent variables vary, we can interpret the temporal drift as a separate process from the latent trajectories. Indeed, by initializing at time scales equal to half the total recording time, these time scales of the Gaussian process kernel remain significantly higher than any behavioural time scale ( Figure 3G). We find most cells cluster at a drift of ≈ 20 • /hr.
The joint AHV-HD plot reveals anticipatory tuning: when animals turn their head, the head direction tuning curves shift in response to head rotations such that cells expected to spike appear to fire earlier than expected. Theoretical studies have shown that this improves temporal decoding, in the sense that the bias-variance trade-off for decoding downstream can be improved with anticipatory tuning [52]. It appears that the head direction population anticipates the future head direction based on current movement statistics, which allows one to reduce the bias introduced with causal decoding. However, the ATI values in Figure 4C are negative, while in the literature they are postive and differ per region. The neural data description files [40] did mention that the zero time frame of behaviour was randomly misaligned to neural spiking data up to 60 ms. Behaviour may be shifted with respect to the neural spike train, indeed in preliminary analyses with shifted spike trains we found values consistent with literature for ATIs when shifiting ≈ 60 ms [52].

A.3 Latent variable analysis of head direction data
In Figure 4D, we see the inferred angular latent signal is closely correlated to the head direction. The linear drift 3.1 ± 1.2 • /hr is smaller but in the same direction as the drift found in the tuning curves in panel C, which cluster around 20 • /hr. The universal model again shows improvement over baseline models, and in particular the latent signal is more correlated to the behavioural head direction. and tentatively identify a delay in the signal represented compared to measured behaviour. Latent trajectory RMSE was computed with 3-fold cross-validation. We align the latent trajectory Equation 61 to the behaviour in the fitting segment, and compute the geodesic RMSE on the held-out validation segment (Appendix E).
At a bin size of 40 ms, the continuity of the trajectory breaks down more often. This shows up as more spread out off-diagonal points in the scatter plot in panel D of Figure 4. Weak prior, Gaussian process priors are more powerful. However, in this work they were not explored due to scalability issues of sampling functions from Gaussian process posteriors. However, recent work [15] has addressed this issue, and specialized techniques for regularly spaced input such as time points [16,17] can be applied when using temporal Gaussian process priors for latent trajectories.

B Parametric count distributions B.1 Poisson distribution
The Poisson count distribution is defined with a mean count λ P Poiss (n|λ) = λ n n! e −λ .
where n ∈ N 0 . It describes a process where discrete events arriving in a time window are all independent of each other. Mathematically, this is consistent with Equation 13 being the limit of a binomial distribution P Bin (n, p) with n → ∞ and p → 0, such that N p = λ.
The Poisson distribution is characterized by the equality of its mean and variance, leading to a Fano factor V[n]/E[n] = 1. Another convenient property is that the sum of two independent Poisson processes is itself a Poisson process with λ = λ 1 + λ 2 . This can be shown directly by considering P (n) = n 0 P (k) P (n − k) or casting it as a limit of the sum of two Bernoulli processes, and follows intuitively from the fact that spike times are independent of each other. As a consequence, for the inhomogeneous case where we have a time-dependent rate λ(t) the count distribution over a longer interval is still Poisson with average T 0 λ(t) dt. Note that this property does generally not hold for non-Poisson distributions, where the count distribution of a sum of counts in separate time windows is not related to the original count distribution in a simple way.

B.2 Non-Poisson count distributions
To account for over-and underdispersed neural activity in real data, i.e. Fano factors above and below 1, other distributions than the Poisson count distribution have been used, and we present common families below.

B.2.1 Zero-inflated Poisson
A common way to introduce overdispersion is to model excess zero counts, which in this context leads to the zero-inflated Poisson (ZIP) process [12]. The count distribution is given by The parameterization leads to E[n] = λ(1 − α) and V[n] = λ(1 − α) + λ 2 α(1 − α) using the law of total variance.

B.2.2 Modulated Poisson distributions
One perspective of non-Poisson distributions is that they arise from noise in the rate parameters λ. Such count processes are referred to as modulated Poisson processes. From a probabilistic point of view, the resulting count distribution is a marginalization with noise parameters θ. A recently proposed flexible spike count model that can give rise to different mean-variance relationships, including decreasing Fano factors at high firing rates similar to what is observed in Figure 4B, builds on this framework [50]. However, the modulated Poisson process can only account for overdispersion with respect to the base Poisson process. Adding noise cannot lead less variability here, and this implies that Fano factors are bounded from below by 1.

B.2.3 Negative binomial
The negative binomial distribution is based on independent Bernoulli trials like the binomial distribution. However, now we count the number of successes before r failures are observed. If we have Bernoulli trials with success probability p, one can obtain the negative binomial distribution with parameterization p = λ r+λ P NB (n|λ, r) = λ n n!
Note this distribution is a specific instance of a modulated Poisson process (Equation 15), with λ ∼ f Gamma (λ; r, λ r ). The parameterization is such that E[n] = λ holds, but V[n] = λ(1 + λ r ) making it overdispersed with respect to a Poisson distribution. In practice, numerical evaluation of the Poisson limit when r = 0 is only approximate due to the numerical precision of the relevant function implementations.

B.2.4 Conway-Maxwell-Poisson
A distribution that handles both over-and underdispersed count distributions is the Conway-Maxwell-Poisson distribution [20] The normalization constant has no closed form expression and must be evaluated numerically It contains the Bernoulli (ν → ∞), Poisson (ν = 1) and geometric (ν → 0) distributions as limiting cases. The notably property is that the CMP distribution provides a smooth transition between these well-known distributions. At integer ν, the The moments of this distribution do not have a closed form expression in general, but can be computed using the partition function through the cumulant generating function K(t) = log E[e tn ] = log Z(λe t , ν) − log Z(λ, ν). The expression for the mean and variance follow to be with approximate expressions [20] which hold well for ν ≈ 1 and λ > 10 ν .

B.3 Linear-softmax count distributions
The count distributions used in this work rely on a linear mapping of the input a combined with a softmax P (n|a; To illustrate the connection of this softmax count distribution used in Equation 1 to Poisson models, consider the distribution specified by the softmax mapping for C = 1 and the element-wise linearexponential described in subsection 2.1, which in this case simply is φ(f ) = (f 1 , e f1 ). This choice contains the truncated Poisson distribution with f as the logarithm of the mean count, corresponding to W j0 = j, W j1 = −1 and b j = 0 with a = φ(f ). Hence for C > 1, our model is a generalization of rate-based models that implicitly assume neurons can be described by a single scalar rate parameter.
The variability in such models is determined by a simple parametric relationship to the rate set by the count distribution, as can be seen for the count distribution families above.

C Neural dispersion and goodness-of-fit quantification C.1 Common variability measures
The traditional Z-score [8,47,23] and Fano factor [28,50] have been used widely in the literature to quantify the variability in neural responses. The two measures are directly related with y denoting spike counts and · the average over the relevant set of trials or time segments of experimental data. Under Poisson spiking statistics, the Fano factor is 1 and the Z-score is distributed as a unit normal variable. Neural data with more or less variability will lead to deviations from this reference for these dispersion measures. Activity more variable than Poisson is called overdispersed, and vice versa for underdispersed activity. These measures are mostly applied to trial-based data [15], but they can also be applied across separate time windows within a given trial or run in continual recordings. In continuous tasks as free animal navigation, the Z-score is often used to quantify variability or dispersion [8,23,12].
Note the normality of Z under Poisson data is only asymptotically true, in the sense that we require the predicted average count y 1. The generalized Z-score in Equation 7 are Gaussian under the true model by design, independent of the spike count magnitudes. However, segments with low expected spike counts around 1 are affected significantly by the dequantization noise, hence the normality in those cases is due to the dequantization rather than model fit.

C.2 Goodness-of-fit measures in the Kolmogorov-Smirnov framework
The measures T KS and T DS introduced in subsection 2.4 provide a statistical goodness-of-fit measures of the model to single neuron count statistics, and is evaluated per neuron. The predictive count distribution of the model is the reference distribution for evaluating Z-scores (Equation 7), and thus allows one to quantify dispersion T DS (Equation 11) and goodness-of-fit T KS (Equation 10) of the data with respect to our predictive model. By using a full predictive model, the Kolmogorov-Smirnov framework is applicable to data beyond repeatable trial structure in the inputs, such as continual recordings of freely moving animals.
For finite N , T KS as defined in Equation 10 has an asymptotic sampling distribution from the Brownian bridge [25]. This statistic can be interpreted as an out-of-distribution score for the observed sample, with significant misfit when T KS is above significance value. Conventional statistics uses hypothesis testing to assess the model fit, with the null hypothesis being that the data is statistically indistinguishable from the predictive model. We can obtain model acceptance regions based on some cutoff significance value of the test statistic under its sampling distribution, often taken to be 5%. An alternative is to assess how close the empirical distribution of the test statistic is to the sampling distribution, which is the expected distribution of the statistic under the predictive model. This can be done with another Kolmogorov-Smirnov test. In this paper, we plot the acceptance regions of T KS and show them compared to baseline models to highlight the model fit improvement on the data it was fit on. T DS was treated similarly as a test statistic for measuring dispersion of the data with respect to the model. We present the asymptotic sampling distribution of T DS below in subsection C.3.

From Equation 22, we can see that our definition of a dispersion measure T DS in Equation 11
is mathematically almost identical to the log Fano factor with generalized Z-scores. By using the generalized Z-scores, we can evaluate dispersion with respect to an arbitrary reference count distribution. However, the role of the two quantities are different. Fano factors are used to provide a measure of the spike count variability, with value 1 placing a reference point at Poisson statistics. On the other hand, T DS is used to quantify whether the observed spike count dispersion is statistically significant compared to variability predicted by the model.
has N s 2 distributed as a χ 2 -distribution with N degrees of freedom.
The moment generating function defined as M (t) = e −tX X is a useful quantity for computing the moments of a distribution p(X). Note that M (n) (0), indicating the n-th derivative with respect to time, gives us (−1) n X n X . When we consider the asymptotic convergence to a normal distribution of the χ 2 -distribution, the distribution of log s 2 has more favourable convergence property as it is less skewed due to the logarithmic transformation [26]. Its moment generating function is which gives rise to the cumulant function From here we can compute the first two cumulants as κ n = K (n) (0) similar to the moment generating function, which are equivalent to the mean and variance of the distribution with ψ(x) = Γ (x) i.e. the first derivative of the Gamma function, and the notation f (x) = df (x)/dx. For values N 20, the following asymptotic expression hold well [26] These properties lead to the construction of the T DS metric in Equation 11. It has an asymptotically normal sampling distribution with mean 0 and variance 2/N − 1, convenient for statistical testing and confidence intervals.  [27]. This is unfavourable for scaling to large or massive datasets. In addition, non-Gaussian likelihoods lead to intractable marginal likelihoods and hence one needs an approximate optimization objective. Stochastic variational inference [44] provides a framework for applying Gaussian process methods using non-Gaussian likelihoods and approximations for scalability. Let us denote the exact GP prior by p(f ), with the vector f the random function points at the input locations X, and the approximate posterior by q(f ). For inference, one needs to be able to (1) efficiently and differentiably sample from q(f ), and (2) efficiently evaluate and differentiate the Kullback-Leibler (KL) divergence between q(f ) and p(f ).
Sparse approximations [37] reduce this to O(M T 2 + M 3 ) computations and O(M 2 ) storage with M inducing points, which effectively aim to summarize the input data with a smaller set of points. Such methods are scalable for large T as long as M T provides sufficient modelling flexibility. The key idea is to extend the function values f with additional function values f u at inducing points. Let us denote inducing points with function values f u at inducing point locations U , which we jointly learn with other variational parameters. The GP kernel evaluated at function point locations is denoted by K XX , and at inducing point locations by K U U . Cross-covariances are denoted by K XU and K U X . The joint variational distribution to the augmented posterior p(f , f u |y) is defined as where the variational distribution q(f u ) = N (m, S). p(f |f u ) is the conditional Gaussian distribution from the generative model. The variational distribution over GP function values q(f ) is simply obtained by marginalizing out f u , which leads to a Gaussian with with sampling for a large input set X problematic. This can be worked around by either diagonalizing q(f u ) or using a combination of basis functions [15]. The reason for the choice in Equation 28 becomes clear when looking at the KL divergence which can be evaluated as long as M is not too large. Due to the cancellation of p(f |f u ), we do not have to deal with the large K XX matrices.
Unlike purely variational approaches, the approximate posterior in Equation 28 amortizes the inference through the learned inducing points, and allows one to obtain a predictive distribution using the approximate posterior evaluated at a new set of inputs X * This property also allows one to apply mini-batching or subsampling to Gaussian processes [38,39].
To accelerate convergence, whitening was used. One performs a change of variables v = L −1 U U f u with L U U L T U U = K U U from the Cholesky decomposition. This transforms p(f u ) into p(v) = N (0, I), and we now directly optimize q(v) = N (m v , S v ) [39]. In practice matrix products with K −1 U U are evaluated using (L U U L T U U ) −1 , which splits up into two triangular matrices that are readily inverted. The whitened representation simplifies Equation 29 as we no longer explicitly compute L −1 U U m and The KL divergence also simplifies as we now have unit normal p(v). To increase the expressivity of multi-output GPs, a separate set of inducing points locations is used for each output dimension (neuron in this work), along with separate kernel hyperparameters as lengthscales for each input and output dimension. This is equivalent to modelling each output dimension by a separate GP, and leads to an overall computational complexity of O(N CT M 2 ) for our model (see section 2 for notation of quantities). A thorough description of a scalable multi-output VSGP framework is given in [33]. We denote the multi-output variational posterior by q(F |X, Z) with inputs X and Z. This variational posterior is used in the variational inference objective Equation 4, more precisely as q(F |X, Z) appearing in Equation 35. One learns the variational parameters consisting of the inducing point locations U as well as the means and covariances of q(F u ).

D.1.2 Generative model and variational inference
The overall generative model Equation 1 as depicted in Figure 1 is with the product of individual count distributions P (Y |Π). The model parameters θ include the GP θ GP and the prior θ pr (hyper)parameters, as well as the softmax mapping weights W n and biases b n . Note that the distribution over count probabilities contains the Gaussian process prior over F . The mapping from F to Π denoted by Π(F ) (Equation 1) is deterministic, and therefore p(Π|F ) is a delta distribution δ(Π − Π(F )).
The exact Bayesian posterior over Π and Z is intractable, hence we use an approximate posterior as defined in Equation 3. The variational parameters ϕ specify the latent variational posterior, while χ consists of inducing point locations X u and the means and covariance matrices of q(U ) for the sparse Gaussian process posterior q(F |X, Z) (Equation 28). The wrapped normal distribution used for circular dimensions in q(Z), i.e. dimensions with z ∈ [0, 2π), takes the form [43] N wrap (z|µ, and was evaluated with a finite cutoff at k = ±5 of the infinite sum. This is an accurate approximation as long as σ 2π. When plotting the standard deviations of the approximate posterior q(Z), we plot σ for both Euclidean as well as circular variables. This is similarly an accurate approximation in the circular case when σ 2π, which was true in practice.

The marginal likelihood in Equation 32
is intractable. Instead, we minimize the negative ELBO or variational free energy loss objective using our approximate posterior (35) which is an upper bound to the negative log marginal likelihood [45,39]. The objective decomposes into a log likelihood expectation term F lik and some regularization terms arising from the model priors F reg . These terms are amenable to Monte Carlo evaluation or quadrature approximation as we show next, and in some cases are even available in closed form.
The variational expectation of the log likelihood can be evaluated using Monte Carlo sampling to obtain unbiased estimates in the general case. As an alternative method, Gauss-Hermite quadratures can provide a deterministic approximation to the expectation with respect to q(F |X, Z) [39] where Π(F ) denotes the transformation from F to count probabilities as in Equation 1. Here we used analogous to Equation 33 for the generative model. This corresponds to a zero variance estimator with a small bias for sufficiently many quadrature points. To allow fast MC sampling, the variational posterior q(F ) is approximated with its diagonal covariance matrix. This removes the correlations between posterior function points at different input values, but allows sampling from very high dimensional distributions (i.e. many time points, large batch sizes). The issue of efficiently sampling from the full posterior has been considered in [15].
The regularization terms can be written as Kullback-Leibler divergences with the ratio of q θ (Π|X D , Z) and p θ,χ (Π|X D , Z) in the first KL divergence equivalent to where we have made the mapping Π(F ) explicit. In practice, the choice of C < K implies this mapping is underparameterized and generally injective for matrices W of rank ≥ C. When we are in the universal limit C = K and W is rank K, the mapping will be bijective. As long as the mapping from F to Π is not many-to-one, we have the continuous random variable transform and this leads to Equation 40 becoming D KL (q(F )||p(F )). Hence, the regularization terms in the loss objective Equation 35 consist of KL divergences that can be computed with analytical expressions in the case when all distributions are Gaussian.

D.2 Latent space priors
We use the Markovian priors as specified in Equation 2. These priors can be specified on different manifolds, in particular we use for Euclidean spaces the linear dynamical system prior In particular, we use diagonal Σ and A to learn factorized latent states. We constrain A ii = a i ∈ (−1, 1) for stability, and we fix Σ ii = σ 2 i = 1/(1 − a 2 i ) to obtain a prior process with stationary variance 1 while optimizing for a i . On the toroidal manifold, we use p(z t+1 |z t ) = N (z t + c, Σ) (44) as due to rotational symmetry A = I. Again, we use diagonal Σ. Both c and Σ ii = σ 2 i are learned as part of the generative model.
When temporally batching input, one has to be careful to retain the continuity in the prior p(Z) with the previous batch (beyond the first batch at the start). This is done by ensuring that the first z t in the batch is the last step in the previous batch, and this will correctly subsample the prior p(Z) defined over the entire input time series. When performing cross-validation with validation segments within the overall input time series, we treat the gap as a discontinuity in the latent trajectory and do not include the latent state right before the validation segment.

D.3 Gaussian process kernel functions
In this work, we used the RBF kernel defined on Euclidean and toroidal manifolds [42]. In particular, this kernel function is given by with rescaled distances for Euclidean and toroidal spaces R and T, respectively. To cover different input dimensions of different topologies, we use product kernels with suitable distances d per input dimension, resulting in sums over dimensions in Equation 45. These distance functions can be used to extend other kernels such as Matérn kernels to non-Euclidean spaces [42].

D.4 Overall algorithm and code
Algorithm 1 Joint latent-regression inference scheme Input spike counts Y D , observed covariates (e.g. behaviour) X D Define neurons N , maximum spike count K, channels C 1: Batch data over temporal dimension while taking into account continuity in the prior p(Z) (see subsection D.2) 2: while not converged and iterations below bound do 3: Adapt learning rates (if annealing) 4: for each batch do Evaluate the basis expansion a = φ(f ) 10: Compute probabilities P (y|a) = softmax(W a + b) for all neurons 11: Compute the loss from the mean of m × k MC samples of the cross-entropy − log P (y D |a) rescaled by ratio of time points in data over batch size 12: Perform the backward pass using automatic differentiation to compute gradients for parameters θ, χ and ϕ 13: Take a gradient step with some optimizer 14: end for 15: end while Additionally, instead of drawing Monte Carlo samples for the Gaussian variational posterior q(F ), we provide the option to compute the Gaussian expectation using Gauss-Hermite quadratures []. This was used to estimate the cvLLs (Equation 9) for models after training, which reduced stochasticity in the cvLL estimate with a negligible bias using 100 quadrature points.
MC sample or quadrature point dimensions are parallelized over in addition to other dimensions like neurons or time, using extra tensor dimensions in modern automatic differentiation libraries. We use PyTorch [36] to implement the algorithm for inference of our model. We use Adam [46] as our optimizer, with no weight decay and default optimizer hyperparameters in PyTorch.
The code provided contains a library with implementations of Gaussian process and GLM based models with different likelihoods as used for baseline models in this paper. In addition to count likelihoods, it contains an implementation of spike-spike and spike-history couplings [29,39] and modulated renewal processes [2,49] to deal with data at the individual spike time level. All models can be run with both observed and latent inputs on Euclidean and toroidal manifolds [42]. The first input dimension had its inducing points uniformly spaced between 0 and 2π for circular dimensions, and −1 to 1 for Euclidean latent dimensions. Observed dimensions had natural intervals defined by the behavioural statistics (e.g. 0 to the mean animal speed), and we placed inducing points uniformly throughout this interval. For the other dimensions, we initialized random inducing point locations based on the topology of the input variable. We place Euclidean variables as a random uniform distribution in its corresponding interval as described previously, while circular variables took on random uniform values in [0, 2π].
The number of inducing points has been shown to scale favourably as O((log T ) D ) for standard Gaussian process regression models [40]. In this work, we used O(D log T ) which captured rich tuning and satisfactory model fits combined with the flexible count distributions. The suggested O((log T ) D ) does become computationally expensive for high dimensional input, and was not tried with the high-dimensional regression models.

D.5.2 Fitting details
We select the model with the lowest loss from 3 separate model fits, initialized with randomized inducing points as described above. The maximum number of training epochs was 3000, but we stopped training before if the loss did not decrease more than ≈ 10 −3 percent over 100 steps. The learning rate was set to 10 −2 , and we also anneal the learning rate every 100 steps by a factor 0.9. In the case of latent spaces, we used a learning rate of 10 −3 for standard deviations of the variational distribution q(Z). All cases lead to satisfactory convergence of the model.
For latent variable models with a single angular latent, we initialize the lengthscale at large values. This avoided the model to overfit and fold the latent space as seen in panel A of Figure 3 for the ANN model. For these models, the best fits were achieved with an initial learning rate of 3 · 10 −2 and 5 · 10 −3 for the kernel lengthscale and the standard deviations of the variational distribution q(Z).

D.5.3 Hardware and fitting time
Synthetic data was analyzed with GeForce RTX 2070 (8 GB of memory). Real data was analyzed with Nvidia GeForce RTX 2080Ti GPUs (with 11 GB of memory). Fitting 33 neurons with ∼ 6 · 10 4 time points with the regression model in Figure 3 takes around 20 minutes, while fitting with a four dimensional latent spaces added takes around 50 minutes. These numbers can fluctuate depending on the flexible stopping criterion above. Generally, there is a trade-off between memory usage and speed by setting the batch size, with larger batch sizes being generally faster but taking more memory.

E Analysis details E.1 Synthetic data
We construct a synthetic head direction cell population inspired by bump attractor models [4,42,42]. Firing rate tuning curves to head direction θ are parameterized as von Mises bump functions with some constant offset f (θ; b, A, β, θ 0 ) = A e β cos (θ−θ0) + b (47) with b > 0 and A > 0. This results in f ≥ 0 for all valid inputs and parameters. For modelling firing rates, we additionally restrict ourselves to β > 0 to avoid inverted bumps at the preferred head direction θ 0 .
For the modulation by a hidden Euclidean signal in the modulated Poisson population, we additionally place Gaussian tuning curves on the latent dimensions with varying standard deviations and means. The Gaussian tuning curves tile the latent space that was traversed, which allows the model to infer the full trajectory. Note that tuning is factorized across the two dimensions (head direction x and latent signal z). Parameters were chosen from random distributions that led to firing rates and variability within the physiological regime.
In the Conway-Maxwell-Poisson (CMP) synthetic population, we place the tuning curves from Equation 47 on parameters ν and the approximate mean µ y = E[y] in Equation 20. Note both parameters have to non-negative to be in the valid range. Furthermore, the tuning curves of ν had potentially negative β ∈ R and different parameter statistics than for µ y . Again, these were chosen such that firing rates and variability were within the physiological regime. From the approximate relation Equation 20 of the mean, we obtain to match roughly the mean counts with von Mises bump patterns. To sample from the CMP distribution once we specified λ and ν, we use the fast rejection sampling method [43].

E.2 Neural data
Data was taken from Mouse 28, session 140313, during the wake phase [40]. The spiking data was recorded at a resolution of 20000 Hz, whereas behaviour was extracted from video recordings of animal body tracking at a resolution of 39.06 Hz. Note the time of the first video frame was randomly misaligned by 0-60 ms to the neural spike trains. We removed invalid behavioural segments in the data and performed linear interpolation across those segments. For circular variables, interpolation was taken in the shortest geodesic distance. We binned spiking data at 1 ms, and interpolated behavioural data to reach the same sampling frequency that is higher than the behavioural recording frequency. At a binning of 40 ms used in our analysis, we had K = 11 as the maximum count value.
We selected head direction cells based on a sparsity criterion, after trying several criteria as mutual information typically used for place cells [44]. First, we binned the head direction variable into 60 equal bins over the range [0, 2π]. For each bin, we now compute the average spike counts y i for head directions within bin i, and the relative occupancy P i . Note i P i = 1 is a probability distribution. Sparsity is defined as and with a selection criterion of sparsity ≥ 0.2 we obtained 33 head direction cells, of which 15 are in postsubiculum. Alternatively, although more computationally intensive, we could directly regress a Gaussian process model (e.g. Poisson baseline model Equation 50) and look at the kernel lengthscales on the angular input dimension. These will be appreciably larger than 2π for cells that are not tuned much to head direction.
Note that quite a few head direction units, which are supposed to represent single cells, show bimodal tuning curves or more to head direction. This is likely due to multiple neurons as signals can pollute in electrophysiological recordings and spike sorting can fail to distinguish between them [45,46].

E.3 Baseline models
The log Cox Gaussian process model puts a GP prior on the rate function of an inhomogeneous Poisson process (Equation 13) with an inverse link function f (x) = e x that is exponential h(x) ∼ GP(µ x , k xx ) λ(x) = f (h(x)) y ∼ P Poiss (y|λ · ∆, θ) where the time bin length is ∆, which turns λ into a proper rate quantity.
The heteroscedastic negative binomial model builds on this encoding model, More precisely, two GPs with an exponential inverse link function are used to model tuning to covariates of the rate λ and inverse shape 1/r of the negative binomial likelihood (Equation 16), leading to the model h(x) ∼ GP(µ x , k xx ), g(x) ∼ GP(µ x , k xx ) λ(x) = f (h(x)), 1 r = f (g(x)) y ∼ P NB (y|λ · ∆, r)

E.4 Generalized Z-scores
The generalized Z-scores in Equation 7 provide a normalized quantification of neural activity under the predictive model. The count distribution P (y) is taken to be the mean posterior count distribution of the posterior q(Π|X, Z). In the case of baseline models, the reference P (y) is given by the parametric distribution (Poisson in Equation 13, negative binomial in Equation 16) evaluated at the mean posterior values of the count distribution parameters given by the Gaussian process mapping (see Equation 50 and Equation 51). This is strictly speaking different from the mean posterior count distribution, as the parametric distribution depends non-linearly on these parameters. However, the difference is insignificant when the variational uncertainties are small, which was often the case in practice.

E.5 Marginal and conditional tuning curves
Due to the high dimensional input space, we can either visualize slices of the tuning curve over the relevant input variables x * or instead marginalize over other input variables. The conditional tuning curves are based on the count distributions P (y|x * , x c ), where x c are fixed and cover the dimensions complementary to x * (these are plotted in Figure 3C). On the other hand, marginalizing over x c is equivalent to an experimenter only looking at neural tuning to x * , which automatically marginalizes over all other behaviour not included. Mathematically, this can be interpreted as considering the x c -dimensions of a Markov Chain Monte Carlo path sampled from the joint density p D (x) which defines the marginalization through the computation done in practice (summing over the time series of observed x c while keeping x * fixed). From this marginalized distribution, we can compute similar quantities as before.
For the tuning indices, we evaluate the the count statistic T y (x * ) with respect to the posterior mean distribution P (y|x * ) after marginalizing (order does not matter as both are sums) to compute the tuning indices as described in Equation 6. Optimization over x * of T y (x * ) is done by grid search, as x * is low-dimensional and we compute its values over a grid anyway for plotting tuning curves of mean, Fano factor or any other count statistic.
We used 300 Monte Carlo samples from q(Π|X, Z) to compute the conditional tuning curves plotted in this paper. For marginalized tuning curves, we use 100 MC samples and temporally subsampled the observed input X D to retain the first time step for every 10 time steps, and used this to evaluate Equation 52. As behaviour shows strong temporal correlations at short time scales ( Figure 3G), this allows us to estimate the marginal tuning curves more efficiently. The mean of these samples was used to compute the mean posterior tuning curves for evaluating the TIs. When evaluating the average mean count and Fano factor at every time step ( Figure 3B and Figure 4B), we used 10 MC samples from q(Π|X, Z). When latent variables were present ( Figure 3E), the 10 MC samples were drawn from q(Z), corresponding to m = 10 and k = 1 in Algorithm 1.

E.6 Temporal cross-correlations of covariates
We use the cross-correlation between time series x t and y t r xy (∆) = (x t+∆ − x t+∆ )(y t − y t ) σ x σ y which includes the auto-correlation as a special case, e.g. r xx (∆). When one of the variables is a circular variable θ t , we use the linear-circular correlation coefficient in [47] s t = sin θ t , c t = cos θ t R xs = r xs (∆), R xc = r xs (∆), R cs = r cs (∆) and for the case when both are circular, we use the circular correlation coefficient proposed by [48] s θ = sin (θ t − Arg E[e iθt ]), same for φ Time scales are estimated from the auto-correlations of covariates. The time scale τ is then chosen as the time step at which the value of the auto-correlation dropped by a factor e from 1 at ∆ = 0.

E.7 Preferred head direction
To compute the preferred head direction θ pref , we use the centre-of-mass of the firing rate profile r(θ) of head direction θ θ pref = Arg[r(θ)e iθ ] which is more robust to noise than taking the angle at which r(θ) is at a maximum. We can evaluate θ pref as a function of angular head velocity (AHV) and absolute time to compute the ATIs and the neural drift as described in Appendix A.

E.8 Circular-linear regression
We computed the circular-linear regression [49] using a measure of the correlation between circular variables θ 1 and θ 2 R = |E[e i(θ1−θ2) ]| By computing R between a circular-linear function φ(t) φ(t) = 2πat + b (58) and the circular data time series θ t , we can perform the regression by maximizing R through optimizing the parameter a with gradient descent. The offset b is obtained analytically From the values a after fitting, one can compute the linear drift values and ATIs as described in Appendix A. In addition, not all cells are well-described by the linear drift or ATIs, so we discarded cells which had an optimized value of R < 0.999. This cutoff was chosen as it retains cells that are visually in agreement with linear relations as seen in Figure 4, while discarding a few outlier cells.

E.9 Latent alignments
To align 1D circular latent trajectories z c to a target trajectory, we minimize their mean geodesic distance under a constant shift µ and potential sign flip s = ±1 z c = s · z c + µ We add a linear drift ∆z c = s · z c + µ + t · ∆ to find potential drifting of the inferred trajectory as done in panel D of Figure 4. This is similar to the circular-linear regression above [49], but with the geodesic distance on the ring instead. This is consistent with root-mean-square errors in the latent signal from behaviour that are computed with the geodesic distances. For 1D Euclidean latent trajectories, we align by fitting a translation and scaling parameter.
In all cases, the root mean squared error (RMSE) of the alignment is evaluated in a cross-validated manner. For circular variables, we use the geodesic distance for computing the squared error just as in aligning. In more detail, we fit the trajectory transformation parameters such that we minimize the errors on the validation segment, and then use these fitted parameters to compute the transformed latent trajectory in the held-out segment. This is then used to compute the RMSE for the alignment of the cross-validation fold.