## Abstract

A stimulus can be encoded in a population of spiking neurons through any change in the statistics of their spike patterns. Thus, the baseline spike statistics in the absence of a stimulus can impact the population’s encoding capacity. Some neurons maintain a baseline firing pattern and can *decrease* their spike rate in response to a stimulus. Not only do baseline firing rates vary widely among different types of neurons, but so do their higher-order statistics, like the degree to which they tend to group their spikes together into bursts and how those bursts are grouped across the population. We investigated how higher-order statistics of baseline spike patterns impact how much information a neural population can transmit about a stimulus that drives a gap in firing using a novel information-theoretic decoding mechanism which we call an “information train.” We discover that there is an optimal level of burstiness for gap detection that is robust to other parameters of the neural population like its size, mean firing rate, and level of correlation. We consider this theoretical result in the context of experimental data from different types of retinal ganglion cells with different baseline spike statistics and determine that the spike statistics of bursty suppressed-by-contrast (bSbC) retinal ganglion cells support nearly optimal detection of both the onset and strength of a contrast step.

## Introduction

Most neurons communicate using sequences of action potentials called spike trains. Decoding information from spike trains involves extracting a signal in the face of noise, but which features of the spike train represent signal versus noise is not always clear [1, 2]. Neurons may differ both in the way they respond to a stimulus as well as in their baseline spike patterns in the absence of a stimulus, and spike trains among a population of neurons can vary in their patterns of correlation. Many papers have examined the role of first-order statistics (mean firing rate) and correlations on the efficiency of information transmission by spike trains [3–7]. Higher order statistics beyond the mean and variance of the spike rate, such as temporal coding and burst coding, have also been studied before, especially in the auditory cortex [8, 9], but not in the context of suppression of firing. Rather than focus on the effect of higher-order statistics as a stimulus *response* feature, we are interested in how suppression of firing codes for a stimulus under different higher-order baseline spike statistic conditions. Here, we study how the burstiness of spike trains affects transmission of information about the timing and duration of a gap in firing rate amongst populations of neurons.

Any change in the statistics of a spike train could theoretically provide information; for sensory neurons we often measure the information with respect to a known sensory stimulus. Many neurons have low or zero spontaneous spike rates [10], so increases in rate in the presence of a stimulus are the most common and well-studied type of information transmission [11–13]. However, the central nervous system (CNS) also contains a wide variety of tonically firing neurons, which are well suited to transmit information by *decreasing* their tonic spike rate. Examples include Purkinje cells [14, 15], lateral geniculate relay neurons [16], spinal cord neurons [17], hippocampal CA1 [18] and CA3 [19] cells, and cortical pyramidal cells [20]. Tonic firing in neurons results from a combination of the influence of synaptic networks and intrinsic electrical properties [21, 22]. Both factors can affect the higher-order statistics of tonic spike trains. Different neurons at the same mean rate can have spike trains which vary along the spectrum from the perfectly periodic clock-like transmission of single spikes to highly irregular bouts of bursting and silence [23, 24].

Retinal ganglion cells (RGCs), the projection neurons of the retina, offer a tractable system for both experimental and theoretical studies of information transmission in spike trains. As the sole retinal outputs, RGCs must carry all the visual information used by an animal to their downstream targets in the brain. RGC classification is more complete, particularly in mice [25], than the classification of virtually any other class of neurons in the CNS. Finally, RGCs of each type form mosaics that evenly sample visual space, creating a clear relationship between the size of a visual stimulus and the number of neurons used to represent it [26].

Inspired by the recent discoveries of how different types of tonically firing RGCs encode contrast [27, 28], here we develop a general theoretical framework to measure the role of burstiness in a neural population’s ability to encode the time and duration of a gap in firing. We propose a continuous readout of a spike train into an “information train” which tracks Shannon information [29] over time. Critically, we show that information trains can be summed in a population in a way that captures the raw spiking activity more faithfully and completely and is more robust to population heterogeneity than combining the underlying spike trains by standard methods such as population peri-stimulus-time-histograms. We first develop a model to generate spike trains with varying levels of burstiness and correlation, then establish metrics to decode the time and duration of a firing gap from the spike trains, and finally measure decoding performance when we vary burstiness, population correlation, firing rate, and population size.

Our principal result is a theoretical justification for why burstiness exists in some cell types; there is an optimal level of burstiness for identifying the time of stimulus presentation and stimulus strength. This result could help explain the variability in burstiness between different types of neurons. We compare our theoretical results to spike trains from recorded mouse RGCs to measure how close each type lies to optimality for each decoding task.

## Results

### A parameterized simulation of populations of tonic and bursty neurons

To study the role of higher-than-first-order statistics in the encoding of spike gaps by populations of neurons, first we needed a parameterized method to generate spike trains with systematic variation in these statistics which could accurately model experimental data. Two RGC types that our lab has studied extensively represent cases near the edges of the burstiness range. Both types have fairly similar mean tonic firing rates (between 40 and 80 Hz). However, while bursty suppressed-by-contrast (bSbC) RGCs fire in bursts of rapid activity interspersed with periods of silence, OFF sustained alpha (OFFsA) RGCs spike in more regular patterns (**Fig 1A**) [28].

A nested renewal process is a doubly stochastic point process that can simulate both bursty and tonic firing patterns with only four parameters (see **Methods**). This process builds upon one of the most commonly used methods to generate spike trains: the Poisson process. The Poisson model of spike generation assumes that the generation of spikes depends only on the instantaneous firing rate [30]. Using a Poisson process to generate spikes leads to exponentially distributed interspike intervals (ISIs; **Fig 1B** left). One obvious limitation of Poisson spike generation, however, is its inability to model refractoriness – the period of time after a neuron spikes during which it cannot spike again. A renewal process extends the Poisson process to account for the refractory period by allowing the probability of spiking to depend on the time since the last spike as well as the instantaneous firing rate [30, 31]. The resulting spike train has gamma-distributed ISIs (**Fig 1B** middle) since the refractory period does not allow for extremely short intervals between spikes.

Even though a renewal process is more physiologically accurate than a Poisson process, it still fails to model burstiness. Therefore we extended the renewal process again by nesting one renewal process inside another [32], similarly to how others have previously constructed doubly stochastic (although non-renewal) processes to model spikes [33]. The outside renewal process parameterized by *κ*_{1} (a shape parameter) and *λ*_{1} (a rate parameter) determines the number and placement of burst events, while the inside renewal process with parameters *κ*_{2}, *λ*_{2} determines the number and placement of spikes within each burst (see **Methods**). Therefore, a nested renewal process can flexibly simulate both regular (1 spike/burst, as in a standard renewal process) and bursty (many spikes per burst) firing patterns. The spike trains generated by a nested renewal process have ISIs which are well fit by a sum of gammas distribution (**Fig 1B** right and **Fig S1**), where the tall, narrow left mode of the distribution corresponds to the intervals between spikes within bursts, and the short, wide right mode of the distribution corresponds to the intervals between bursts. Finally, a nested renewal process models activity of bSbCs and OFFsAs well; tuning its parameters results in good approximations to ISI distributions of experimentally measured bSbC (**Fig 1C**) and OFFsA RGCs (**Fig 1D**), since bSbCs and OFFsAs have ISI distributions which are well fit by sum of gammas as well (**Fig S2**).

In order to quantify burstiness by a single measure, we defined a burstiness factor, *β*, as the average number of spikes per burst normalized by the firing rate (measured in seconds/burst). **Fig 1E** illustrates the effect of different levels of burstiness on the ISI distribution for a fixed firing rate. Both the number of bursts and the number of spikes are randomly generated within our model, so it is important to note that a burst can contain as little as zero spikes by chance; in other words, a “burst” as defined by the outside renewal process may not always correspond to what we would label “bursts” by eye. This goes against the common view of a burst as needing to contain at least a certain number of spikes within a certain period of time. However, bursts determined by the outside renewal process will contain, on average, the number of spikes dictated by the inside renewal process, and we prefer our quantification of a burstiness factor since it eliminates the need to threshold, thus reducing both the number of parameters and arbitrariness. Our quantification of burstiness is dependent upon the parameters of the nested renewal process and cannot be applied to experimental data; therefore, we matched ISI distributions from experimental data with simulated data (**Fig 1C,D**) via a Kolmogorov-Smirov test in order to measure burstiness in spike trains from recorded RGCs (**Fig 1F**). In addition to variable burstiness, the nested renewal process allowed us to independently modulate the firing rate and to simulate neural populations of any desired size with different levels of correlation (**Fig 2A**; see **Methods**).

We modeled a stimulus-dependent drop in firing as an instantaneous drop to a rate of zero followed by an exponential rise back to the baseline rate, controlled by a variable time constant of recovery (**Fig 2B**). We also introduced heterogeneity into the population of neurons by making a subset of them unresponsive to the stimulus, continuing to spike with baseline statistics after stimulus onset (**Fig 2C**). This stimulus model was chosen because it is consistent with how both bSbC and OFFsA RGCs respond to positive contrast; longer time constants of firing recovery correspond to higher contrast stimuli [28, 34]. However, the stimulus model generalizes to other types of neurons as well. Analogous to models for detection of the onset and duration of an increase in firing, this stimulus allowed us to measure performance in decoding the onset and duration of the firing gap for each simulated population of neurons.

### The information train enables aggregation of heterogeneous spike trains

By aggregating the spike trains of the population of neurons into a one-dimensional signal over time, decoding the onset and duration of a gap in firing can be formulated as a threshold detection task; the choice of threshold determines when the decoder reports that the population is in the baseline state versus the stimulus (gap detected) state (**Fig 3A**). The choice of threshold naturally implies a tradeoff between reaction time, the delay from stimulus onset until the threshold is crossed, and false detection rate, the rate at which the threshold is crossed in the absence of the stimulus. To simplify our analysis, we chose a decoder threshold for each population that achieved a fixed, low error rate of 0.1 false detections per second. Thus, for the detection task, reaction time, *δ*, was the sole performance metric for the decoder. The performance metric for the duration task was more complex and is considered in the subsequent section.

Next, we considered the choice of the aggregation method used to combine spike trains across cells. The simplest method, which is commonly used in such situations, is to compute the population peri-stimulus-time-histogram (pPSTH) by simply collecting all spikes in the population and smoothing to create a continuous rate signal (**Fig 3F** right). However, there are many properties of population activity that can carry information besides the average firing rate, which is the most commonly used first-order statistic [35, 36]. The pPSTH loses some of the information about the higher-order statistics of each individual spike train that could, in principle, be useful to the decoder. Therefore, we developed a new aggregation method based on the information content of the spike trains. We called the resulting signal the “information train.”

The information train method was inspired by neural self-information analysis [37]. In order to build a continuous signal of the information content in a single cell over time, we started by computing a self-information (SI) curve from the ISI distribution using Shannon’s definition [29]: *SI* = −*log*_{2}(*p*), where *p* is the ISI probability. The SI curve gives the information contained in every observed ISI (**Fig 3B**). SI can be thought of as a measure of surprise: very probable ISIs correspond to small SI, while very improbable ISIs correspond to large SI. A spike train can be equally well described by its ISIs as its spike times; for each ISI observed in the spike train, we may use the SI curve to find the amount of information it contains. Neural self-information analysis then replaces each ISI in a spike train with the corresponding SI value [34], but this creates a discrete signal. Instead, our threshold detection paradigm requires a continuous signal that can be aggregated across cells.

Thus, we developed a new analysis to translate a spike train into a continuous self information train (**Fig 3C** top). We separate ISIs into three cases. In the first case, the current ISI is of the same length as the most probable, or most commonly observed ISI for the cell. Therefore, this ISI contains the lowest possible SI, which we call the “baseline information.” The SI train always begins at baseline information, and in this case, remains there for the duration of the ISI. In the second case, the current ISI is longer than the most probable one. However, we do not know that it is longer until the point in time when it has passed the length of the most probable ISI and the cell has not yet spiked. The SI train reflects this by beginning at baseline information and staying constant for the duration of the most probable ISI, then rising according to the SI curve and stopping at the SI value indicated by the total current ISI (blue sections in **Fig 3C** bottom). Finally, in the third case, the current ISI is shorter than the most probable one. In this case, we know that it is shorter as soon as it ends, so the SI train stays at baseline information until the very end of the ISI, then rises instantaneously to the value indicated by the SI curve (green sections in **Fig 3C** bottom). At the start of each ISI, the SI train resets to baseline information, meaning that it has no memory of its history and considers subsequent ISIs independently.

For a single cell, the ISI distribution contains all the information in the spike train, so the self information train is a lossless representation. However, for multiple cells, there are joint ISI distributions to consider. Assuming independence between cells, the information contained in the population is the sum of the information trains of each cell (see **Methods**). Independence is usually not a reasonable assumption, but experimental data from different RGC types show extremely low pairwise noise correlations (**Fig S3**). Furthermore, several studies [38–40], including in the retina [41–43], have shown that only a small amount of information is lost when neural responses are decoded using algorithms that ignore noise correlations – on the order of 10% of the total information. Therefore we have constructed population information trains by summing the SI trains of each cell in the population (**Fig 3F** right) while recognizing that it is a lower bound on the total information present in the full population of spike trains. Correlations can also be induced by a stimulus, and stimulus correlations in populations of neurons are typically larger than noise correlations [42, 44–46]. Our goal, however, was to study the optimal baseline spike statistics for a task in which all responsive neurons were suppressed by the stimulus simultaneously. Thus, non-independence in our population before stimulus onset and after stimulus offset (in other words, when the population was active) was only due to noise correlations.

A critical benefit of the population information train over the pPSTH is that it should automatically amplify the contribution of responsive cells in a population (which will have large SI) relative to unresponsive cells (which will have small SI) without the need to define a cell selection criterion or weighting strategy. To test this intuition, we simulated populations of 30 neurons and varied their responsivity to the stimulus. We decoded the gap onset time for each of these populations using both the pPSTH and the population information train. The pPSTH decoder was extremely sensitive to responsivity, usually failing to reach the error threshold when fewer than 80% of cells were responsive (**Fig 3G left**). The population information train decoder approach was robust to very large fractions of unresponsive cells (**Fig 3G right**). Subsequent analyses used the population information train decoder and 100% responsivity, but our conclusions are robust to lower responsivity (**Fig S4**).

### There exists optimal burstiness for minimizing reaction time

Having developed a readout mechanism – the population information train – we then used it to decode the onset of a gap in firing by measuring reaction time. We were interested in the effects of multiple parameters on reaction time: burstiness, firing rate, population size, and correlation. For three of these parameters, we had an intuition for how they should affect performance. When trying to decode the onset of a gap, temporal precision is key to getting accurate estimates. Increasing the firing rate of a population increases temporal precision, so we expected that increasing the firing rate would improve performance (i.e. decrease reaction time). Another way to gain temporal precision is to increase the size of a population – therefore we also expected population size to have a positive effect on performance. A population which has very low cell correlations is unlikely to have many overlapping spikes from different cells, while a population with high correlation is likely to have major overlap. This implies that temporal precision, and thus performance, should decrease with correlation. In contrast to the other parameters, the intuition for how burstiness affects performance is not simple and was our central question in this part of the study.

We isolated the effect of burstiness on reaction time by holding firing rate, population size, and correlation constant and found a non-monotonic relationship (**Fig 4A**), suggesting that for each combination of parameters there may be a different level of burstiness that minimizes reaction time. We could continue in this way to isolate the effects of each parameter by holding the other parameters constant at different values, but the number of parameters, and their ranges, makes this impractical. A more elegant approach is to try to find a unifying principle which can collapse the data. Dimensional analysis gives us the tools to identify such a unifying principle.

Dimensional analysis uncovers the relationships between variables by tracking their physical dimensions over calculations. Since responsivity is held constant at 100%, there are only five relevant quantities altogether: reaction time *δ*, burstiness *β*, firing rate *λ*, population size *N*, and correlation *α*. These are all either dimensionless or some transformation of the physical dimension “time,” so there is only one relevant dimension in this problem. By the Buckingham Pi Theorem [47], it is possible to construct 5 − 1 = 4 independent dimensionless groups which are related by some unknown function. We chose to make reaction time and burstiness dimensionless by multiplying by firing rate (brackets denote the dimension of the quantity inside), so we obtain
We may write any one of these dimensionless quantities as a function of the rest, but it is difficult to fit functions of three variables, so we fix population size and correlation so that the function no longer depends on them, and then we have
This immediately reveals that both reaction time and burstiness are inversely proportional to the firing rate. While burstiness was defined in such a way that it must be inversely proportional to firing rate (see **Methods**), it is illuminating that reaction time is inversely proportional as well. This basic result of dimensional analysis gives us much more information than our intuition, which simply told us that reaction time should decrease with firing rate.

The functional form of *f* is now possible to find by fitting (**Fig 4B**). Dimensional analysis collapses all the data for different firing rates together, with no pattern, so we can analyze them together. There is a clear minimum, demonstrating that there is some level of burstiness which is optimal for minimizing reaction time across all firing rates. Now we may separately vary population size and correlation (**Fig 4C,D**), illustrating that the existence of optimal burstiness for minimizing reaction time is robust for both parameters. The x and y values of the minima of these curves completely describe how optimal burstiness and minimum reaction time depend on population size and correlation – simply dividing the x and y values of the minimum by the firing rate of the population, we obtain the exact (dimensionful) optimal burstiness and minimum reaction time. Minimum reaction time decreases with population size and increases with correlation (as predicted by our intuition), while optimal burstiness is relatively constant with both parameters, indicating that there is one level of burstiness optimal for detecting stimulus onset at any population size and correlation.

### There exists optimal burstiness for discriminating stimulus strength

Besides “when,” another fundamental question to ask about a stimulus is: how strong is it? In our case, the strength of the stimulus corresponds to the duration of the gap in spiking. We varied stimulus strength by changing the recovery time of the burst firing rate (**Fig 2B**) and measured the gap length in the information train (**Fig 3E** bottom right), or the duration of time between stimulus onset and offset detection, in order to make deductions about how the length of the gap in spiking is affected by the suppressed firing rate (**Fig 5A**). The performance metric here – the measure of how well a population can discriminate the stimulus strength – is essentially the accuracy with which the time constant of recovery is captured by the gap length measurement. Therefore the performance metric should actually be an error metric: a measure of how much error there is in the relationship between gap length and the time constant. However, it is not immediately obvious what the relationship between these two variables actually is, much less how much scatter there is around it. Dimensional analysis is again a useful tool here. There are six relevant quantities to this problem: gap length *γ*, recovery time constant *τ*, burstiness *β*, firing rate *λ*, population size *N*, and correlation *α*. Applying the Buckingham Pi Theorem [47] and setting all but two of the dimensionless groups (see **Methods**) to be constant (so that we obtain a function of one variable which relates the gap length and time constant of recovery), we have

By fitting, it is clear that there is an exponential relationship between these variables (**Fig 5B**). Now, for each combination of parameters, for which we have several trials of gap length measurements, we chose the performance metric to be the scatter in the data around the exponential function suggested by dimensional analysis (**Fig 5C**) which we quantified with normalized root mean square error (NRMSE; see **Methods**).

Now that we have a performance metric, we wanted to see how it depended on burstiness in particular, as well as firing rate, population size, and correlation. Once again, we could isolate the effects of each of these parameters by holding all the others constant (**Fig 6A**), but using dimensional analysis simplifies the problem by collapsing data. The relevant quantities are the performance metric NRMSE *ϵ*, burstiness *β*, firing rate *λ*, population size *N*, and correlation *α*, so applying the Buckingham Pi Theorem [47] (see **Methods**) and setting population size and correlation to be constant, we have

While burstiness is still inversely proportional to the firing rate, this reveals that NRMSE is constant with firing rate. In other words, the ability to decode the duration of a gap in firing rate does not depend on the spontaneous firing rate. Fitting reveals that f is a function which decays monotonically to a nonzero constant (**Fig 6B**) (see **Methods**). Interestingly, there is a large range of burstiness that optimizes NRMSE, in contrast to how there was a single optimal value of burstiness for minimizing reaction time.

Next we separately varied population size and correlation (**Fig 6C,D**), which showed us that the existence of asymptotic NRMSE and a large range of burstiness resulting in this optimal performance was robust for both parameters. We chose to study the “elbows” of these curves because if we can describe the lowest level of burstiness that optimizes NRMSE, we will know that any burstiness larger than that will also be optimal. Asymptotic NRMSE decreases with population size and increases with correlation, while optimal burstiness is relatively constant with both parameters, implying that the level(s) of burstiness optimal for discriminating stimulus strength is not affected by population size or correlation.

### bSbC RGCs have close to optimal burstiness for gap detection

Having established that optimal burstiness exists for decoding the onset and duration of a gap in firing, our next question was how the baseline spike statistics of the RGC types we studied relate to this optimum. Recall that we fit functions that describe how reaction time (**Fig 4**) and NRMSE (**Fig 6**) depend on dimensionless burstiness. Dimensionless burstiness is simply burstiness multiplied by firing rate, so another way to represent the information in **Figs 4, 6** is as a 3 dimensional graph of reaction time (or NRMSE) vs. burstiness and firing rate (**Fig 7**). We compared performance of different types of experimentally recorded cells by using their firing rate and burstiness to predict how quickly they would detect stimulus onset (**Fig 7A**) and how accurately they would discriminate stimulus strength (**Fig 7B**) according to our model. Thus the experimental data necessarily lie on the surfaces. We set the population size at 5 cells and the correlation between 0-0.2, but as we saw earlier, optimal burstiness is negligibly affected by both the population size and correlation so these results apply to any values of those parameters. Both bSbCs and OFFsAs are quite good at detecting a stimulus quickly, although bSbCs are closer to optimal reaction time, while both types of sustained suppressed-by-contrast RGCs are much worse at this task. However, **Fig 7B** suggests that bSbCs are by far the best at discriminating stimulus strength, since their burstiness puts them right on the lower bound of optimal burstiness; OFFsA RGCs do not perform nearly as well. Therefore, bSbC RGCs have spiking patterns which enable them to both react to a stimulus coming on as quickly as possible and detect the strength of that stimulus with great accuracy.

## Discussion

To investigate the role of burstiness in population decoding of firing gaps, we simulated spike trains for populations of neurons using a nested renewal process. This strategy allowed us to capture the statistics of recorded spike trains from RGCs and also to vary burstiness parametrically (**Figs 1, 2**). We then developed a new analysis to combine spike trains across a population into an information train and demonstrated that this method is more robust to unresponsive cells than a population PSTH (**Fig 3**). Using the information trains of different simulated populations, we discovered that there is an optimal burst rate for detecting the onset of a firing gap that is inversely proportional to firing rate and relatively independent of population size and correlation (**Fig 4**). There is also an optimal range of burstiness for detecting gap duration that is relatively independent of all of these other parameters (**Figs 5, 6**). Finally, we considered the baseline spike statistics of four RGC types in the context of these theoretical results and revealed that the burst patterns of bSbC RGCs make them nearly optimal for detecting both the onset and the strength of a contrast step (**Fig 7**).

### Baseline spike statistics can be optimized for different decoding tasks

The fact that some burstiness higher than minimum is optimal for identifying the time of stimulus presentation and stimulus strength leads us to the same overarching principle as [48–50]: different spiking patterns are optimal for encoding different stimulus features. We extended their results by demonstrating two concrete examples of how optimal encoding of different stimulus features depends on spiking patterns. The theoretical implications of this are that different neurons may display different spiking statistics depending on the stimulus features they encode.

Earlier we provided intuition about the effects of firing rate, population size, and correlation on reaction time. The same intuition holds for performance on decoding the duration of a gap in firing, since it is also dependent on temporal precision. Now we will lay out our intuition for why there exists optimal burstiness for both tasks, which will explain why it makes sense that specific spiking patterns are optimal for encoding different stimulus features. We again attribute increased performance on these tasks to increased temporal precision. As explained earlier, a population could theoretically increase its temporal precision by increasing its firing rate or its size, or decreasing its cell correlations; however, there are physiological limits to a population’s firing rate and correlation, and the size of the population activated by a stimulus is related to the size of the stimulus [26]. In contrast, rearranging the pattern of spikes is something that a cell can reasonably do, and if it rearranges its spiking pattern so that it has long periods of silence and short periods of rapid activity, or bursts, then it has great temporal precision within those bursts. If that cell is part of a population with low pairwise cell correlations (**Fig S3**), then the bursts from different cells would be staggered in time, allowing the population as a whole to have excellent temporal precision. Therefore it is clear that increasing burstiness can improve performance. Following this line of thinking, it is also easy to see why too much burstiness would be harmful: if the population is trying to detect stimulus onset, the periods of silence between bursts could become so long that even with many cells it is very possible that the population would entirely miss the stimulus simply because no cell was in the middle of a burst when the stimulus was presented. In addition, the fact that firing rate, population size, and correlation had effects on performance that were predictable from this intuition lends further credibility to our result that there is an optimal level of burstiness for identifying the timing and strength of a stimulus.

We found that the baseline firing statistics of bSbC RGCs lie close to the optimum for performance on the gap detection task (**Fig 7**), but what about the other 3 high-baseline-rate RGC types we analyzed and the many lower baseline rate ones we did not analyze? It is important to consider (1) that the gap detection as we defined it is only one out of the many decoding tasks that must be performed simultaneously across the population of RGCs, and (2) that biological constraints, including the energy expenditure of maintaining a high spike rate, also likely drove the evolution of the different RGC types. Like bSbC RGCs, sSbC RGCs also signal exclusively with gaps in firing, but they do so at a substantially longer timescale [27, 51]. Perhaps for these RGCs, the energetic benefit of maintaining a lower baseline firing rate is worth the cost of lower temporal resolution in gap detection because they represent time more coarsely than bSbC RGCs. The OFF alpha RGCs (OFFsA and OFFtA) respond to different stimuli with either increases or decreases in baseline firing [52, 53], and OFFtA RGCs have been implicated in the detection of several specific visual features [54–56], so for these RGC types, performance in representing spike gaps needs to be balanced against metrics for their other functions.

### The information train as a way to track population-level changes in spike patterns

Our work has practical as well as theoretical implications: we proposed the information train readout which tracks the information in a population over time, and which is more comprehensive and robust than a standard pPSTH readout. For example, if you use a pPSTH to analyze a population of direction selective cells without first finding the preferred directions of every single cell in the population, the responses from cells preferring opposite directions will cancel each other out, leading to a pPSTH which fails to reflect change due to the stimulus. The same effect will be observed if you present a light-dark edge to a population of ON or OFF cells. Of course, it is possible to remove this effect by classifying each cell’s response to the stimulus first, but that can be time consuming if the population size is large and it requires an (often arbitrary) supervised classification step. The information train will reflect a stimulus change in both of these cases even when the whole population is analyzed together. This is because any ISI length out of the ordinary (i.e. different from the most probable one) causes a positive deflection in the information train, so no matter whether a cell’s firing rate increases or decreases in response to a stimulus, the population information rises. Therefore the information train is a convenient readout mechanism to use because it does not require any assumptions to cluster cells, even when the population recording is heterogeneous. It is also easy to implement by simply fitting the observed ISI distributions in the absence of a stimulus to a gamma (many neural circuits have ISI patterns well fit by a gamma distribution [57]) or sum of gammas distribution (depending on whether the cells in question are bursty).

In fact, we believe that the information train nicely bridges a middle ground between the pPSTH and another standard readout mechanism: manifold analysis [58, 59]. The information train is more complete than a pPSTH but less complete than manifold analysis; only slightly more difficult to implement than a pPSTH and far easier to implement than manifold analysis; equally interpretable as a pPSTH and far more interpretable than manifold analysis. We suggest that the information train can be used to determine whether a population-level change has occurred, and then manifold analysis can be used to investigate that change more thoroughly.

### Limitations and future directions

The information train is not a complete description of the full mutual information in the population since it assumes independence. The natural extension of SI to a population is pointwise mutual information (PMI) [29] which measures the association of single events. Much like entropy is the expected value of SI, mutual information is the expected value of PMI. In the future, a more accurate way to construct the population information train would be to use PMI: the value of the information train at time t is given by measuring the time since the last spike for each cell in the population and then calculating the PMI in the coincidence of those events. This is quite difficult to implement because it requires describing the joint ISI distributions. Not only is it necessary to estimate the joint ISI probabilities during the prestimulus time – when we can generate a lot of data but it is still difficult to estimate the joint probabilities – it is also necessary to be able to accurately extrapolate those joint ISI distributions in order to deduce the stimulus response. Namely, we need to be able to accurately estimate the very tails of the joint ISI distributions. We were able to do this in our study by fitting our observed ISI distributions with a sum of gammas distribution (**Fig S2**), but it could be dangerous to first estimate the joint ISI distributions and then estimate their tail values. Even small errors in the density estimation would lead to drastically different results, since PMI (much like SI) amplifies the significance of extremely improbable events. In other words, small differences in *p* lead to large differences in *log*_{2}(*p*) for low *p*. Calculating the true PMI of the population over time is an important future direction, but it has to be done carefully.

Another limitation of this study is that we have not formally seen how sensitive the information train readout is to more complex tasks/stimuli. We posit that the information train should be able to flexibly reflect any change in spiking patterns at the population level since every observed ISI different from the most common one registers a change. However, we only compared the information train to the standard pPSTH for one task; we do not know if the information train gives a more accurate or robust readout than a pPSTH for all stimuli.

Not only is there potential follow up research stemming directly from this study, such as constructing the population information train using PMI and exploring more complicated stimuli, but in addition, the information train will hopefully be used to investigate more complex statistics of spiking and other variables that affect information transmission. Moreover, the information train can be used to study other change detection tasks besides changes in spiking activity. For example, it could be applied to the problem of detecting auditory frequency change, where gaps are similarly important. This theoretical framework is general enough that it can be applied in a variety of research directions.

## Methods

### Model of spike generation

In the Poisson model of spike generation, the instantaneous firing rate is the only force which generates spikes. Assuming a constant firing rate *λ* over time, the number of spikes within an interval of time Δ*t* is a Poisson distributed random variable with parameter *λ*. Using the Poisson probability density function,
One can therefore generate a sequence of uniformly distributed random numbers {*x*_{n}} in order to determine when to generate a spike: if *x*_{i} *< λ*Δ*t*, generate a spike in time bin *i*. Since the number of spikes within any interval of time is a Poisson random variable with parameter *λ*, the exponential distribution *Exp*(*λ*) describes the time between spikes, i.e., the interspike interval distribution.

A renewal process extends the Poisson process to depend on both the instantaneous firing rate and the time since the previous spike. One way to model this is to simply generate a Poisson spike train with rate parameter *λ* and delete all but every *κ*^{th} spike. A gamma distribution predicts the wait time until the th Poisson process event, so the ISI distribution of a renewal process generated in this way is described by *Gamma*(*κ, λ*). This is a natural extension of the Poisson process since if *κ* is 1, the ISI distribution is *Gamma*(1, *λ*) = *Exp*(*λ*). The average ISI length is and therefore the average firing rate is .

We extended the renewal process again, nesting one renewal process inside another. The outside renewal process with parameters *κ*_{1} (a shape), *λ*_{1} (a rate) determines the number and placement of bursts, with an average of bursts/second. We allowed *λ*_{1} to vary between 30 and 600 bursts/second, and *κ*_{1} to vary between 3 and 6 in order to obtain a range of 10-100 bursts/second. The inside renewal process with parameters *κ*_{2}, *λ*_{2} determines the number and placement of spikes within each burst, with an average of spikes/second within each burst. We let *λ*_{2} range from 300 to 6,000 spikes/second, and *κ*_{2} range from 3 to 6. Thus the average spiking rate within a burst ranged from 100-1000 spikes/second. We inserted spikes generated by the inside process into bursts by setting a burst length *τ*^{b}, which we chose to be 10 ms. This results in an average of spikes/burst, with a range of 1-10 spikes/burst, and therefore an average firing rate
with a range of 10-100 Hz.

We varied pairwise correlations between nested renewal process spike trains by using a shared vs. random seed strategy. Just as for a Poisson process, we needed two sequences (for the inside and outside renewal processes) of uniformly distributed random numbers, {*x*_{n}} and {*y*_{n}}, to determine when to generate spikes (see above). We first generated two shared sequences of uniform random numbers, and , then for each cell in the population generated new independent sequences of uniformly distributed numbers, and . For each cell, {*x*_{n}} was constructed by drawing from with probability *α*_{x} and with probability 1 − *α*_{x}. The other sequence of uniform numbers, {*y*_{n}}, was constructed similarly with weight *α*_{y}. The choices of *α*_{x} and *α*_{y} control the pairwise correlations of the inside and outside renewal processes. We measured cross correlations between all pairs of spike trains in the population and averaged.

### Burstiness

**Definition**. *Burstiness β* is the average number of spikes per burst normalized by firing rate: .

**Remark**. Using Eq (6), the formula for burstiness can be simplified to . The range of burstiness is 0.01-0.1 seconds/burst.

### Population readout mechanisms

Self information is defined as *SI* = −*log*_{2}(*p*) [29], where *p* is the ISI probability. We constructed a population information train by summing individual SI trains. This implicitly assumes that cells are independent since if independence holds, then
In order to measure reaction time and gap length in the information train, we set a threshold on the information train such that the error rate was 0.1 false crossings per second.

We set the threshold for the pPSTH readout mechanism at 0, and set the filter length such that the pPSTH reached 0 with a rate of 0.1 errors/second during the prestimulus time.

Population information train options B and C considered in **Fig S5** were constructed as follows:

Option B: The population information train is the SI train constructed from the ensemble spike train, or the spike train obtained by overlaying all the spike trains in the population.

Option C: The population information train was constructed by summing the SI trains for each cell in the population, but we let ISIs shorter than the most probable one have negative deflections in each SI train, while longer ISIs still had positive deflections.

The analogous threshold (resulting in 0.1 errors/s) was placed on the information trains in options B and C.

### Dimensional analysis

We used dimensional analysis to find the relationship between gap length and the time constant of recovery (**Fig 5A,B**). There are six relevant quantities to this problem: gap length *γ*, recovery time constant *τ*, burstiness *β*, firing rate *λ*, population size *N*, and correlation *α*. These are all either dimensionless or some transformation of “time”, so by the Buckingham Pi Theorem [47], we may construct 6 − 1 = 5 independent dimensionless groups related by an unknown function. We chose to make gap length, the recovery time constant, and burstiness dimensionless by multiplying by firing rate, so we have
In order to obtain a function of one variable which relates gap length and the time constant, we set dimensionless burstiness, population size, and correlation to be constant and obtained Eq (3).

Once we defined a performance metric (NRMSE) for the gap length decoding task (**Fig 5C**), we used dimensional analysis again to find its dependence on burstiness and the other model parameters (**Fig 6A,B**). The relevant quantities are the performance metric NRMSE *ϵ*, burstiness *β*, firing rate *λ*, population size *N*, and correlation *α*, so by the Buckingham Pi Theorem [47], we may construct 5 − 1 = 4 independent dimensionless groups related by a function. We again chose to make burstiness dimensionless by multiplying by firing rate, so we have
Fixing population size and correlation so that the function no longer depends on them, we obtain Eq (4).

### Fitting routines

From dimensional analysis, we obtained Eq (4). Plotting *λβ* against *ϵ* revealed that *f* is an asymptotically decaying function (**Fig 6B**). We therefore let *f* be of the form
The range of *n* obtained from this fitting routine was 0.5-4 with no systematic trends, so *n* was taken as 2, which results in essentially the same good fits and reduces the number of fitting parameters to two. We also considered an exponential fit, but ultimately rejected it because the fit was not as good as the fit given by Eq (10) with fixed *n*, and it requires three fitting parameters instead of two. Goodness of fit of Eq (10) with variable *n* as well as fixed *n*, and the exponential fit, were all assessed with root mean square error (RMSE) returned by the fitting routine.

We fit the exponential relationship between the dimensionless time constant and the dimensionless gap length (**Fig 5B,C**) by fitting a linear relationship of their logarithms. Goodness of fit was assessed with normalized root mean square error (NRMSE):

**Definition**. The NRMSE of an estimator *ŷ* with respect to observed values of a parameter *y* is the square root of the mean squared error, normalized by the mean of the measured data,

We fit ISI distributions of simulated (**Fig S1**) and experimentally recorded data (**Fig S2**) by minimizing the mean square error of the log of the data and log of a sum of gammas distribution, since there was a large dynamic range in the data. This effectively minimizes the percentage difference between the data and the fitting function. Goodness of fit was assessed with RMSE returned by the fitting routine.

### RGC spike recordings

RGCs were recorded in cell-attached configuration in whole-mount preparations of mouse retina as described previously [28, 34]. Cell typology was determined using the methods described in [25]. Baseline spiking was recorded at a mean luminance of 0 and 1000 isomerizations (R*) per rod per second.

## Data and code availability

All simulated data reported in this paper will be available from the GIN database upon manuscript acceptance. Experimental data reported will be shared by the lead contact, Greg Schwartz (greg.schwartz{at}northwestern.edu), upon request. All code written in support of this publication will be available from the GitHub repository upon manuscript acceptance.

## Supporting information

## Acknowledgements

We are grateful to all members of the Schwartz Lab for their feedback and support on this project. We would like to thank Stephanie Palmer and Sophia Weinbar for their comments on the manuscript, as well as Sara Solla and Ann Kennedy for their feedback on the project.