Spectrally specific temporal analyses of spike-train responses to complex sounds: A unifying framework

Significant scientific and translational questions remain in auditory neuroscience surrounding the neural correlates of perception. Relating perceptual and neural data collected from humans can be useful; however, human-based neural data are typically limited to evoked far-field responses, which lack anatomical and physiological specificity. Laboratory-controlled preclinical animal models offer the advantage of comparing single-unit and evoked responses from the same animals. This ability provides opportunities to develop invaluable insight into proper interpretations of evoked responses, which benefits both basic-science studies of neural mechanisms and translational applications, e.g., diagnostic development. However, these comparisons have been limited by a disconnect between the types of spectrotemporal analyses used with single-unit spike trains and evoked responses, which results because these response types are fundamentally different (point-process versus continuous-valued signals) even though the responses themselves are related. Here, we describe a unifying framework to study temporal coding of complex sounds that allows spike-train and evoked-response data to be analyzed and compared using the same advanced signal-processing techniques. The framework uses a set of peristimulus-time histograms computed from single-unit spike trains in response to polarity-alternating stimuli to allow advanced spectral analyses of both slow (envelope) and rapid (temporal fine structure) response components. Demonstrated benefits include: (1) novel spectrally specific temporal-coding measures that are less confounded by distortions due to hair-cell transduction, synaptic rectification, and neural stochasticity compared to previous metrics, e.g., the correlogram peak-height, (2) spectrally specific analyses of spike-train modulation coding (magnitude and phase), which can be directly compared to modern perceptually based models of speech intelligibility (e.g., that depend on modulation filter banks), and (3) superior spectral resolution in analyzing the neural representation of nonstationary sounds, such as speech and music. This unifying framework significantly expands the potential of preclinical animal models to advance our understanding of the physiological correlates of perceptual deficits in real-world listening following sensorineural hearing loss.

Normal-hearing listeners demonstrate excellent acuity while communicating in complex 2 environments. In contrast, hearing-impaired listeners often struggle in noisy situations, 3 even with state-of-the-art intervention strategies (e.g., digital hearing aids). In addition 4 to improving our understanding of the auditory system, the clinical outcomes of these 5 strategies can be improved by studying how the neural representation of complex 6 sounds relates to perception in normal and impaired hearing. Numerous 7 electrophysiological studies have explored the neural representation of perceptually 8 relevant sounds in humans using evoked far-field recordings, such as frequency following 9 responses (FFRs) and electroencephalograms (Tremblay et al., 2006;Clinard et al., 10 2010;Kraus et al., 2017). Note that we use electrophysiology and neurophysiology to 11 refer to evoked far-field responses and single-unit responses, respectively (See Table 2 12 for glossary). While these evoked responses are attractive because of their clinical 13 viability, they lack anatomical and physiological specificity. Moreover, the underlying 14 sensorineural hearing loss pathophysiology is typically uncertain in humans. In contrast, 15 laboratory-controlled animal models of various pathologies can provide specific neural 16 correlates of perceptual deficits that humans experience, and thus hold great scientific 17 and translational (e.g., pharmacological) potential. In order to synergize the benefits of 18 both these approaches to advance basic-science and translational applications to 19 real-world listening, two major limitations need to be addressed. 20 First, there exists a significant gap in relating spike-train data recorded invasively 21 from animals and evoked noninvasive far-field recordings feasible in humans (and 22 animals) because the two signals are fundamentally different in form (i.e., binary-valued 23 point-process data versus continuous-valued signals). While the continuous nature of Spectrogram and waveform of a speech segment (s 4 described in Materials and Methods). Formant trajectories (black lines in panel A) and short-term intensity (red line in panel B, computed over 20-ms windows with 80% overlap) vary with time, highlighting two nonstationary aspects of speech stimuli. (C) PSTH constructed using spike trains in response to a tone at the AN-fiber's characteristic frequency [CF, most-sensitive frequency; fiber had CF=730 Hz, and was high spontaneous rate or SR (Liberman, 1978)]. Tone intensity = 40 dB SPL. Even though the stimulus is stationary, the response is nonstationary (i.e., sharp onset followed by adaptation). (D) Period histogram, constructed from the data used in C, demonstrates the phase-locking ability of neurons to individual stimulus cycles. (E) PSTH constructed using spike trains in response to a sinusoidally amplitude-modulated (SAM) CF-tone (50-Hz modulation frequency, 0-dB modulation depth, 35 dB SPL) from an AN fiber (CF = 1.4 kHz, medium SR). (F) Period histogram (for one modulation period) constructed from the data used in E. The response to the SAM tone follows both the modulator (envelope, red, panels E and F) as well as the carrier (temporal fine structure), the rapid fluctuations in the signal (blue, panel F). Bin width = 0.5 ms for histograms in C-F. Number of stimulus repetitions for C and E were 300 and 16, respectively. adaption effects in the response (Kiang et al., 1965;Westerman and Smith, 1988). p(t) 136 has been applied to analyze spike trains recorded in response to periodic signals, both in 137 the temporal and spectral domains (Young and Sachs, 1979;Delgutte, 1980;Palmer 138 et al., 1986). A limitation of the p(t)-spectrum is that it is corrupted by harmonic 139 distortions due to the rectified nature of the PSTH response (Young and Sachs, 1979). 140 For example, the spectrum of a PSTH constructed using spike trains recorded from an 141 AN fiber in response to a tone (F c ) can show energy at F c as well as 2F c even though later section. Similar to the period histogram, the PSTH can also be used to derive 145 phase-locking metrics, such as V S and V S pp . These synchrony-based metrics have been 146 recently overshadowed by correlogram-based metrics, which are described next, since 147 the synchrony-based metrics are limited to periodic signals. In contrast, 148 correlogram-based approaches offer more general metrics to evaluate temporal coding of 149 both periodic and aperiodic stimuli in the ENV/TFS dichotomy. 150 Interspike-interval (ISI) based approaches 151 Interspike interval histogram analyses were developed to quantify the correlation 152 between two spike trains, either from the same neuron or from different 153 neurons (Hagiwara, 1954;Rodieck et al., 1962;Perkel et al., 1967a,b). Interspike 154 intervals between adjacent spikes (also called first-order intervals) within a stimulus 155 trial are used to construct per-trial estimates of the ISI histogram, which are then 156 averaged across trials to form the final first-order ISI histogram (Fig 2C). An 157 alternative to the first-order ISI histogram, called the all-order ISI histogram (or the 158 autocorrelogram), can be estimated in a similar way with the only difference being the 159 inclusion of intervals between all spikes within a trial (not only adjacent spikes) to 160 construct the histogram ( Fig 2E) (Rodieck, 1967;Møller, 1970). The autocorrelogram 161 has been used to study the temporal representation of stationary as well as 162 nonstationary stimuli (Bourk, 1976;Sinex and Geisler, 1981;Cariani and Delgutte, 163 1996a,b). While the autocorrelogram is attractive for its simplicity, it is confounded by 164 refractory effects (Figs 2E and 2F). In particular, since successive spikes within a single 165 trial cannot occur within the refractory period, the autocorrelogram shows an 166 artifactual absence of intervals for delays less than the refractory period ( Fig 2E). As a 167 result, the autocorrelogram spectrum is partly corrupted. 168 Joris and colleagues extended these ISI-based analyses to remove the confounds of 169 the refractory effects by including all-order interspike intervals across stimulus trials to 170 compute a shuffled correlogram (Louage et al., 2004). A shuffled correlogram computed 171 using spike trains in response to multiple repetitions of a single stimulus from a single 172 neuron is called the shuffled autocorrelogram (or the SAC, Fig 2G). Similarly, a shuffled 173 correlogram computed using spike trains from different neurons, or for different stimuli, 174 is called the shuffled cross-correlogram (or the SCC ). The use of across-trial all-order 175 ISIs provides substantially more smoothing than simple all-order ISIs because many 176 more intervals are included in the histogram (compare Fig 2E with Fig 2G,and Fig 2F 177 with Fig 2H). 178 In addition, both polarities of the stimulus can be used to separate out ENV and 179 TFS components from the response. Stimuli with alternating polarities share the same 180 envelope, but their phases (TFS) differ by a half-cycle at all frequencies. By averaging 181 shuffled autocorrelograms for both stimulus polarities and shuffled cross-correlograms 182 for opposite stimulus polarities, the polarity-tolerant (ENV) correlogram (called the 183 sumcor ) is obtained (Louage et al., 2004) (S4 Appendix). Similarly, the 184 polarity-sensitive (TFS) correlogram, the difcor, is estimated as the difference between 185 the average autocorrelogram for both stimulus polarities and the cross-correlogram for 186 opposite stimulus polarities (S4 Appendix). These functions have been preferred over 187 PSTH-based analyses for estimating correlation sequences and response spectra (Joris 188 et al., 2006;Cedolin and Delgutte, 2005;Rallapalli and Heinz, 2016). Shuffled 189 autocorrelograms have also been used to derive temporal metrics, such as the 190 correlogram peak-height and half-width, to quantify the strength and precision of 191 temporal coding in the response, respectively (Louage et al., 2004), including for 192 nonstationary stimuli (Sayles and Winter, 2008;Sayles et al., 2015;Paraouty et al., 193 2018). In addition, cross-correlograms have been used to develop metrics to quantify 194 ENV/TFS similarity between responses to different stimuli recorded from the same Both the first-order (C) and the all-order (E) ISI histograms show dips for intervals less than the refractory period (∼0.6 ms), with the corresponding spectra corrupted by these refractory effects. (G) The shuffled autocorrelogram. (H) DFT of G. The shuffled autocorrelogram is smoother compared to the other correlograms, which also leads to improved SNR in the spectrum at both the carrier and modulator frequencies. All these ISI histograms are corrupted by rectifier distortion at twice the carrier frequency (2F c ). Bin width = 50 µs for histograms in C, E, and G.
neuron [e.g., speech stimuli (Heinz and Swaminathan, 2009;Swaminathan and Heinz, 196 2012; Rallapalli and Heinz, 2016)], or between responses from different neurons (Joris 197 et al., 2006;Heinz et al., 2010;Swaminathan and Heinz, 2011). 198 Although correlogram-based analyses provide a rich set of temporal metrics, they 199 suffer from three major limitations. First, correlograms discard phase information in the 200 response. Response phase can convey important information, especially for complex A unified framework for quantifying temporal coding 213 based on alternating-polarity PSTHs (apPSTHs)

214
In this section, we first show that apPSTHs can be used to unify classic metrics, e.g.,

215
VS and correlograms, in a computationally efficient manner. Then, we show that 216 apPSTHs offer more precise spectral estimates compared to correlograms, and allow for 217 perceptually relevant analyses that are not possible with classic metrics.

219
Let us denote the PSTHs in response to the positive and negative polarities of a 220 stimulus as p(t) and n(t), respectively. Then, the sum PSTH, s(t), which represents the 221 polarity-tolerant component in the response, is estimated as The difference PSTH, d(t), which represents the polarity-sensitive component in the 223 response, is estimated as The difference PSTH has been previously described as the compound PSTH (Goblick and Pfeiffer, 1969). Here we use the terms sum and difference for s(t) and d(t), respectively, for simplicity. Compared to the spectra of the single-polarity PSTHs [i.e., of p(t) or n(t)], the spectrum of the difference PSTH, D(f ), is substantially less affected by rectifier distortion artifacts (Sinex and Geisler, 1983). This improvement occurs because even-order distortions, which strongly contribute to these artifacts, are effectively canceled out by subtracting PSTHs for opposite polarities. The Fourier magnitude spectrum of the difference PSTH has been referred to as the synchronized rate. We show that the synchronized rate relates to V S by where f is frequency in Hz, and N is the total number of spikes (S2 Appendix).

225
It can also be shown that the autocorrelogram and the autocorrelation function of 226 the PSTH are related (S3 Appendix). In particular the SAC for a set of M spike trains 227 X = {x 1 , x 2 , ..., x M } can be estimated as where R X is the autocorrelation operator, and P ST H X is the PSTH constructed using 229 X. Similarly, the SCC for two sets of spike trains X = {x 1 , x 2 , ..., x L } and 230 Y = {y 1 , y 2 , ..., y M } can be estimated as where P ST H X and P ST H Y are PSTHs constructed using X and Y , respectively, and 232 R X Y is the cross-correlation operator. Since SACs and SCCs can be computed using 233 apPSTHs, it follows that sumcor and difcor can also be computed using apPSTHs (S4 234 Appendix The PSTH is particularly attractive because the PSTH from single neurons or a 256 population of neurons, by virtue of being a continuous signal, can be directly compared 257 to evoked potentials in response to the same stimulus (e.g., Fig 3). In this example, the 258 speech sentence s 3 was used as the stimulus to record the frequency following response 259 (FFR) from one animal. The same stimulus was also used to record spike trains from 260 AN fibers (N=246) from 13 animals. The mean d(t) and mean s(t) were computed by 261 pooling PSTHs across all neurons. The difference and sum FFRs were estimated by 262 subtracting and averaging the FFRs to alternating polarities, respectively. This 263 approach of estimating polarity-tolerant and polarity-sensitive components from FFRs 264 is well established (Aiken and Picton, 2008;Shinn-Cunningham et al., 2013;265 Ananthakrishnan et al., 2016). Qualitatively, the periodicity information in the mean 266 d(t) and the difference FFR were similar ( Fig 3A); this is expected because the 267 difference FFR receives significant contributions from the auditory nerve (King et al., . To compare the spectra for the two responses, a 100-ms segment was considered. 269 The first formant (F 1 ) and the first few harmonics of the fundamental frequency (F 0 ) 270 were well captured in both spectra. F 2 was also well captured in the difference FFR, 271 and to a lesser extent, in the mean d(t).

272
The mean s(t) and the sum FFR also show comparable temporal features in these 273 nonstationary responses (Fig 3C) Temporal information in a signal can be studied not only in the time domain (e.g.,

291
using correlograms) but also in the frequency domain (e.g., using the power spectral 292 density, PSD). The frequency-domain representation often provides a compact 293 alternative compared to the time-domain counterpart. In the framework of spectral 294 estimation, the source ("true") spectrum, which is unknown, is regarded as a parameter 295 of a random process that is to be estimated from the available data (i.e., from examples 296 of the random process). Spectral estimation is complicated by two factors: (1)  approach optimally reduces the bias and variance of the PSD estimate (Thomson, 1982;308 Babadi and Brown, 2014). In this approach, for a given data length, a frequency 309 resolution is chosen, based on which a set of orthogonal tapers are computed. These 310 tapers include both even and odd tapers, which can be used to obtain the independent 311 PSD estimates to be averaged. In contrast, only even tapers can be used with 312 correlograms as they are even sequences (Oppenheim, 1999;Rangayyan, 2015).

313
Therefore, variance in the PSD estimate can be reduced by a factor of up to 2 by using 314 apPSTHs instead of correlograms. Lower spectral-estimation variance can be achieved using apPSTHs (with multiple tapers) compared with difcor correlograms. (A) Spectrum for the 100-ms segment in the speech sentence s3 (F 0 ∼ 98 Hz, F 1 ∼ 630 Hz) used for analysis. (B) Example spectra for an AN fiber (CF=900 Hz, high SR) with spikes from 25 randomly chosen repetitions per polarity. The first two discrete-prolate spheroidal sequences were used as tapers corresponding to a time-bandwidth product of 3 to estimate D(f ), the spectrum of d(t). No taper (i.e., rectangular window) was used to estimate the difcor spectrum. The AN fiber responded to the 6th, 7th and 8th harmonic of the fundamental frequency. (C) Error-bar plots for fractional power (P ower F rac ) at the frequency (green triangle) closest to the 6th harmonic. Error bars were computed for 12 randomly and independently drawn sets of 25 repetitions per polarity. The same spikes were used to compute the spectra for d(t) (blue) and difcor (red). (D) Diamonds denote the ratio of variances for the difcor -based estimate to the d(t)-based estimate. This ratio was greater than 1 (i.e., above the dashed gray line) for all units considered, which demonstrates that the variance for the multitaper-d(t) spectrum was lower than the difcor -spectrum variance. AN fibers with CFs between 0.3 and 2 kHz and with at least 75 repetitions per polarity of the stimulus were considered. Bin width = 0.1 ms for PSTHs. Sampling frequency = 10 kHz for FFRs. Stimulus intensity = 65 dB SPL.
For example, the benefit (in terms of spectral-estimation variance) of using the 316 multitaper spectrum of d(t), as opposed to the discrete Fourier transform (DFT) of the 317 difcor, can be quantified by comparing the two spectra at a single frequency (Fig 4). In 318 this example, a 100-ms segment of the s 3 speech stimulus was used as the analysis 319 window. The segment had an F 0 of 98 Hz and F 1 of 630 Hz ( Fig 4A). Fig Rallapalli and Heinz, 2016).

348
A major advantage of PSTH-based approaches over correlogram-based approaches is 349 that they can be used to extend a wider variety of acoustic SI models to include 350 neurophysiological data. In particular, correlograms can be used to extend 351 power-spectrum-based SI models (Kryter, 1962;Houtgast and Steeneken, 1973;Cooke, 352 2006;Taal et al., 2011;Jørgensen and Dau, 2011) but not for the more recent SI models 353 that require phase information of the response (Relaño-Iborra et al., 2016;Scheidiger 354 et al., 2018). For example, the speech envelope-power-spectrum model (sEPSM) has 355 been evaluated using simulated spike trains since sEPSM only requires power in the 356 response envelope, which can be estimated from the sumcor spectrum (Rallapalli and 357 Heinz, 2016). However, sumcor cannot be used to evaluate envelope-phase-based SI 358 models since it discards phase information. Studies have shown that the response phase 359 can be important for speech intelligibility (Delgutte et al., 1998;Paliwal and Alsteris, 360 2003). In contrast to the sumcor, the time-varying PSTH contains both phase and 361 magnitude information, and thus, can be used to evaluate both power-spectrum-and 362 phase-spectrum-based SI models. For example, because the PSTH p(t) [or n(t)] is 363 already rectified, it can be filtered through a modulation filterbank to estimate "internal 364 representations" in the modulation domain (Fig 5). These spike-train-derived "internal 365 representations" are analogous to those used in phase-spectrum-based SI 366 models (Relaño-Iborra et al., 2016;Scheidiger et al., 2018) and can be further processed 367 by existing SI back-ends to estimate SI values. In summary, apPSTHs can be used to 368 estimate complete (amplitude and phase) spectrally specific modulation-domain 369 representations using modulation filterbanks. These analyses allow for the evaluation of 370 a wider variety of acoustic-based SI models in the neural domain, where translationally 371 relevant data can be obtained from preclinical animal models of various forms of SNHL. 372 In this section, we first describe existing and novel ENV and TFS components that can 375 be derived from apPSTHs. Next, we compare the relative merits of the novel 376 components over existing ENV and TFS components using simulated AN fiber data.

377
Finally, we apply these apPSTHs to analyze spike-train data recorded in response to 378 speech and speech-like stimuli.

379
Several ENV and TFS components can be derived from 380 apPSTHs 381 The neural response envelope can be obtained from apPSTHs in two orthogonal ways: 382 (1) the low-frequency signal, s(t), and (2) the Hilbert envelope of the high-frequency 383 carrier-related energy in d(t). s(t) is thought to represent the polarity-tolerant response 384 component, which has been defined as the envelope response (Joris, 2003;Louage et al., 385 2004). For a stimulus with harmonic spectrum, s(t) captures the envelope related to the 386 beating between harmonics. In addition, onset and offset responses (e.g., in response to 387 high-frequency fricatives, Fig 3C) are also well captured in s(t). Although sumcor and 388 s(t) are related, dynamic features like onset and offset responses are captured in s(t), 389 but not in the sumcor since the sumcor discards phase information by essentially 390 averaging ENV coding across the whole stimulus duration. The use of sum envelope is 391 popular in far-field responses (Aiken and Picton, 2008;Shinn-Cunningham et al., 2013;392 Ananthakrishnan et al., 2016) but not directly in auditory neurophysiology studies. A 393 major disadvantage of s(t) is that it is affected by rectifier distortions if a neuron phase 394 locks to low-frequency energy in the stimulus (e.g., Fig 6A). We discuss this issue of 395 rectifier distortion in more detail later in the following section.

396
A second way envelope information in the neural response can be quantified is by computing the envelope of the difference PSTH, d(t). This envelope, e(t), can be estimated as the magnitude of the analytic signal, a(t), of the difference PSTH where a(t) = d(t) + H{d(t)}, and H{·} is the Hilbert transform operator. The factor 397 √ 2 normalizes for the power difference after applying the Hilbert transform. d(t) is 398 substantially less affected by rectifier distortion (Sinex and Geisler, 1983), and thus, so 399 is e(t). The use of e(t) parallels the procedure followed by many computational models 400 that extract envelopes from the output of cochlear filterbanks (Dubbelboer and 401 Houtgast, 2008;Jørgensen and Dau, 2011;Sadjadi and Hansen, 2011). The relative 402 merits of e(t) and s(t) to represent the response envelope is discussed in the following 403 section.

404
The TFS component can also be estimated in two ways: (1) d(t), and (2) the cosine of the Hilbert phase of d(t). The difference PSTH has been traditionally referred to as the TFS response because it is the polarity-sensitive component. difcor and metrics derived from it relate to d(t) as the difcor is related to the autocorrelation function of d(t) (S4 Appendix). However, d(t) does not represent the response to only the carrier (phase) since it also contains envelope information in e(t). We propose a novel representation of the TFS component in the response, φ(t), estimated as the cosine phase of the analytic signal where normalization by Relative merits of sum and Hilbert-envelope PSTHs in 407 representing spike-train envelope responses 408 The relative merits of the two envelope PSTHs, s(t) and e(t), were evaluated based on 409 simulated spike-train data generated using a computational model of AN 410 responses (Bruce et al., 2018). The model includes both cochlear-tuning and hair-cell 411 transduction nonlinearities in the auditory system. Responses were generated for 24 AN 412 fibers whose CFs were logarithmically spaced between 250 Hz and 8 kHz. Model 413 parameters are listed in S1 Table. For each simulated fiber, a SAM tone was used as the 414 stimulus. The SAM tone carrier (F c ) was placed at CF for each fiber, and was 415 modulated by a 20-Hz modulator (F m ) at 100% (i.e., 0-dB) modulation depth. The 416 intensity was ∼65 dB SPL for all CFs, with slight adjustments to maintain a consistent 417 discharge rate of 130 spikes/s (e.g., to account for the middle-ear transfer function that 418 is included in the model). The number of repetitions per stimulus polarity was set to 25, 419 which is typical of auditory-neurophysiology experiments.

420
Modulation spectra were estimated for s(t) and e(t) [denoted by S(f ) and E(f ), This filtering minimized the spectral energy in d(t) that was not stimulus related. The 424 two envelopes were evaluated based on their representations of the modulator and 425 rectifier distortion. Rectifier distortions are expected to occur at even multiples of the 426 carrier frequency and nearby sidebands (i.e., 2nF c , 2nF c − F m , and 2nF c + F m for 427 integers n, Fig 6A). It is desirable for an envelope metric to consistently represent 428 envelope coding across CFs and be less affected by rectifier-distortion artifacts.

429
Modulation coding for the simulated responses was quantified as the power in 10-Hz 430 bands centered at the first three harmonics of F m (i.e., 15 to 25 Hz, 35 to 45 Hz, and 55 431 to 65 Hz) for both s(t) and e(t) (Fig 6D). The need to include multiple harmonics of 432 F m arises because the response during a stimulus cycle departs from sinusoidal shape 433 due to the saturating nonlinearity associated with the inner-hair-cell transduction 434 process (S1 Fig). While F m -related power was nearly constant across CF for s(t), it was 435 nearly constant for e(t) only up to 1.2 kHz, after which it rolled off. This roll-off for e(t) 436 is not surprising since e(t) relies on phase-locking near the carrier and the sidebands, as 437 confirmed by the strong correspondence between tonal phase-locking at the carrier 438 frequency and F m -related power in e(t) (Fig 6D).

439
The analysis of rectifier distortion was limited to only the distortion components 440 near the second harmonic of the carrier (i.e., 2F c , 2F c − F m , and 2F c + F m ) since this 441 harmonic is substantially stronger than higher harmonics (e.g., Fig 6A). Rectifier ). Prior to estimating the peak-height, the sumcor is sometimes adjusted by adding 459 an inverted triangular window to compensate for its triangular shape (Heinz and 460 Swaminathan, 2009). Here, sumcors were compensated by subtracting a triangular 461 window from it so that the baseline sumcor is a flat function with a value of 0 (instead 462 of 1) in the absence of ENV coding. This baseline value of 0 for the sumcor is the same 463 as the baseline value for the difcor in the absence of TFS coding. In S5 Appendix, we 464 show that the sumcor peak-height is a broadband metric and it is related to the total Envelope-coding metrics should be spectrally specific to avoid artifacts due to rectifier distortion and neural stochasticity. Simulated responses for 24 AN fibers (log-spaced between 250 Hz and 8 kHz) were obtained using a computational model (see text) using SAM tones at CF (modulation frequency, F m =20 Hz; 0-dB modulation depth) as stimuli. Stimulus intensity ∼ 65 dB SPL. S(f ) (blue) and E(f ) (red) for three example model fibers with CFs = 1.0, 1.7, and 4 kHz (panels A-C) illustrate the relative merits of s(t) and e(t), and the potential for rectifier distortion to corrupt envelope coding metrics. d(t) was band-limited to a 200-Hz band near F c for each AN fiber prior to estimating e(t) from the Hilbert transform of d(t). (A) For the 1-kHz fiber, S(f ) and E(f ) are nearly identical in the F m band. S(f ) is substantially affected by rectifier distortion at 2×CF, which can be ignored using spectrally specific analyses. (B) The two envelope spectra are largely similar near the F m bands since phase-locking near the carrier (1.7 kHz) is still strong (panel D). Rectifier distortion in S(f ) is greatly reduced since phase-locking at twice the carrier frequency (3.4 kHz = 2×1.7 kHz) is weak. (C) F m -related power in E(f ) and rectifier distortion in S(f ) are greatly reduced as the frequencies for the carrier and twice the carrier are both above the phase-locking roll-off. (D) The strength of modulation coding was evaluated as the sum of the power near the first three harmonics of F m (gray boxes in panels A-C) for S(f ) (blue squares) and E(f ) (red circles). V S pp was also quantified to CF-tones for each AN fiber (black dashed line, right Y axis). (E) Rectifier distortion (RD) analysis was limited to the second harmonic of the carrier (brown boxes in panels A-C). RD was quantified as the sum of power in 10-Hz bands around twice the carrier frequency (2 × CF ) and the adjacent sidebands (2 × CF ± F m ). RD for E(f ) is not shown because E(f ) was virtually free from RD. (F) Raw and adjusted sumcor peak-heights across CFs. sumcors were adjusted by band-pass filtering them in the three F m -related bands. Large differences between the two metrics at low frequencies indicate that the raw sumcor peak-heights are corrupted by rectifier distortion at these frequencies. (G) Relation between raw and adjusted sumcor peak-heights with F m -related power (from panel D) in S(f ). Good correspondence between F m -related power in S(f ) and adjusted sumcor peak-height supports the use of spectrally specific envelope analyses. (H) The difference between raw and adjusted sumcor peak-heights was largely accounted for by RD power. However, this difference was always greater than zero, suggesting broadband metrics can also be biased because of noise related to neural stochasticity.
rectifier distortion at 2×CF. Here, we generalize this issue by comparing the sumcor 473 and spectrally specific ENV metrics for narrowband SAM-tone stimuli to demonstrate 474 the limitations of any broadband ENV metric. sumcors were adjusted by band-limiting 475 them to 10-Hz bands near the first three harmonics of F m . As expected, the difference 476 between the raw and adjusted sumcor peak-heights was large at low CFs (Fig 6F), 477 where rectifier distortion corrupts the broadband sumcor peak-height metric. At high 478 CFs (above 1.5 kHz), the difference between raw and adjusted sumcor peak-heights was 479 small but nonzero. These differences correspond to power in S(f ) at frequencies other 480 than the modulation-related bands and reflect the artifacts of neural stochasticity due 481 to finite number of stimulus trials. As power is always nonnegative, including power at 482 frequencies unrelated to the target frequencies adds bias and variance to any broadband 483 metric. The adjusted sumcor peak-height, unlike the raw sumcor peak-height, showed 484 good agreement with spectrally specific F m -related power in S(f ) (Fig 6G).

485
Overall, these results support the use of spectrally specific analyses to quantify ENV 486 coding in order to minimize artifacts due to rectifier distortion as well as the effects of 487 neural stochasticity. Of the two candidate apPSTHs to quantify response envelope, e(t) 488 had the benefit of minimizing rectifier distortion. However, e(t)'s reliance on 489 carrier-related phase locking limits the use of e(t) as a unifying ENV metric across the 490 whole range of CFs. Instead, spectrally specific s(t) is more attractive because of its 491 robustness in representing the response envelope across CFs ( Fig 6D). In order to evaluate the relative merits of d(t) and φ(t) in representing the neural 495 response TFS, the same set of simulated AN spike-train responses were used as in Fig 6. 496 The stimulus, a CF-centered SAM-tone, has energy at the carrier frequency (F c ) and 497 the sidebands (F c ± F m ). The sidebands are 6 dB lower in amplitude compared to the 498 carrier amplitude for the 100% modulated stimulus. Although the stimulus has power at 499 the carrier and the sidebands, only the carrier representation should be considered 500 towards quantifying the response TFS because the energy at the sidebands arises due to 501 the modulation of the carrier by the modulator (ENV). As the carrier has energy at a 502 single frequency (F c ) for a SAM tone, the desirable TFS response should have 503 maximum energy concentrated at the carrier frequency and not at the sidebands.

504
Therefore, the merits of d(t) and φ(t) were evaluated based on how well they capture 505 the carrier and suppress the sidebands (Fig 7).

506
As mentioned previously, d(t) was band-limited to a 200-Hz bandwidth near the 507 carrier frequency before estimating φ(t). D(f ) at low CFs contained substantial energy 508 at both the carrier and the sidebands (Figs 7A and 7B). This indicates that d(t) 509 represents the complete neural coding of the SAM tone (both the envelope and the 510 carrier) and not just the carrier. Furthermore, D(f ) has additional sidebands 511 (F c ± 2F m ) around the carrier frequency. These sidebands arise as a result of the 512 saturating nonlinearity associated with inner-hair-cell transduction (S1 Fig), and thus, 513 should not be considered towards TFS response. In contrast, Φ(f ), the spectrum of φ(t) 514 had most of its power concentrated at the carrier frequency, with substantially less 515 power in the sidebands (Figs 7A and 7B). These results were consistent across a wide 516 range of CFs and for both sidebands (Figs 7D and 7E). Overall, these results show that 517 φ(t) is a better PSTH compared to d(t) in quantifying the response TFS since φ(t) 518 emphasizes power at the carrier frequency and not at the sidebands.

519
In the following, we apply apPSTH -based analyses on spike-train data recorded from 520 chinchilla AN fibers in response to speech and speech-like stimuli. In these examples, we 521 particularly focus on certain ENV features, such as pitch coding for vowels and response 522 The neuron's CF was close to the first stimulus formant ( Fig 8B). P (f ) shows a strong 528 response to the 6th harmonic (first formant). In addition, there is substantial energy 529 near 1200 Hz, the frequency corresponding to the second formant. However, this peak 530 near 1200 Hz results from rectifier distortion from the first formant and not in response 531 to second formant itself; this is confirmed by D(f ) (Fig 8D), which shows a clear peak 532 at the 6th harmonic and little energy near the 12th harmonic. Similar to P (f ), S(f ) 533 shows substantial energy at 2F 1 due to rectifier distortion at twice the TFS (F 1 ) In panels B-E, the frequency-threshold tuning curve (TC θ, black) of the neuron is plotted on the right y-axis. P (f ) and S(f ) are corrupted by rectifier distortion at 2F 1 frequency. The response primarily reflects TFS-based F 1 coding (E) and little envelope coding (D), which is consistent with the "synchrony capture" phenomenon for stationary vowel coding. Fig 8E] where the spectrum shows substantial energy only at the 6th harmonic of the 538 fundamental frequency. These results are consistent with the previously reported 539 phenomenon of "synchrony-capture" in the neural coding of stationary vowels (Young 540 and Sachs, 1979;Delgutte and Kiang, 1984a). In "synchrony-capture", the response of a 541 neuron with CF near a formant is dominated by the harmonic of the fundamental 542 frequency closest to the formant. As the response primarily follows a single sinusoid, the 543 Hilbert-envelope PSTH, e(t), is essentially flat across the vowel duration and has little 544 energy other than at 0 Hz [not shown for ease of visualization of Φ(f )]. As a result, 545 there is little difference between φ(t) and d(t) for this stationary vowel.

546
Neural characterization of ENV and TFS using apPSTHs for a 547 natural speech segment

548
Most previous studies have used the period histogram to study speech coding in the 549 spectral domain (Young and Sachs, 1979;Delgutte and Kiang, 1984a). The period 550 histogram is limited to stationary periodic stimuli, which were employed in those 551 studies. In contrast, the use of apPSTHs facilitates the spectral analysis of neural 552 responses to natural speech stimuli, which need not be stationary (Fig 9). In this 553 example, the response of a low-frequency AN fiber to a 100-ms vowel segment of the s 3 554 natural speech sentence was considered. The CF (1.1 kHz) of this neuron is close to the 555 second formant (F 2 ) of this segment ( Fig 9B). P (f ) shows peaks corresponding to F 2 556 (∼1.2 kHz) and F 0 (∼130 Hz, Fig 9C). Similar to Fig 8,   Onset envelope is well represented in the sum PSTH but not in 568 the Hilbert-envelope PSTH 569 In addition to analyzing spectral features, apPSTHs can also be used to analyze 570 temporal features in the neural response. An example temporal feature is the onset 571 envelope, which has been shown to be important for the neural coding of 572 consonants (Delgutte, 1980;Heil, 2003), in particular fricatives (Delgutte and Kiang,573 1984b). A diminished onset envelope in the peripheral representation of consonants has 574 been hypothesized to be a contributing factor for perceptual deficits experienced by 575 hearing-impaired listeners (Allen and Li, 2009), and thus is an important feature to 576 quantify. Fig 10 shows example onset responses for a high-frequency AN fiber (CF= 5.8 577 kHz, SR= 70 spikes/s) for a fricative (/s/) portion of the speech stimulus s 3 . The onset 578 is well captured in single-polarity PSTHs [p(t) and n(t), Fig 10A] and in the sum 579 envelope [s(t), Fig 10B]. Since the onset is a polarity-tolerant feature, it is greatly 580 reduced by subtracting the PSTHs to opposite polarities. As a results, the response 581 onset is poorly captured in d(t) (Fig 10C) and its Hilbert envelope, e(t) (Fig 10D).  Overall, these examples show that apPSTHs can be used to study various spectral 583 and temporal features in the neural response for natural and synthesized stimuli in the 584 ENV/TFS dichotomy. These apPSTHs are summarized in Table 1.

TFS
Small Carrier TFS (subject to TFS phase locking) We define apPSTHs as the collection of PSTHs derived using both polarities of the stimulus. The pair of PSTHs, p(t) and n(t), is a sufficient statistic for apPSTHs since all other PSTHs in the group can be derived from the two. Alternatively, the pair, d(t) and s(t), is also a sufficient statistic for apPSTHs. Each PSTH (e.g., the positive polarity PSTH) can be expressed in the time domain [p(t)] or in the frequency domain [P (f )].
Quantifying ENV and TFS using apPSTHs for 586 nonstationary signals

587
In the discussion so far, we have argued for using spectrally specific metrics to analyze 588 neural responses to stationary stimuli. Another example where spectral specificity is 589 needed is in evaluating the neural coding of nonstationary speech features (e.g., formant 590 transitions). Speech is a nonstationary signal and conveys substantial information in its 591 dynamic spectral trajectories (e.g., Fig 1A). A number of studies have investigated the 592 robustness of the neural representation of dynamic spectral trajectories using frequency 593 glides and frequency-modulated tones as the stimulus (Krishnan and Parkinson, 2000; 594 Skoe and Kraus, 2010;Clinard and Cotter, 2015;Billings et al., 2019). These studies 595 have usually employed a spectrogram analysis. While a spectrogram is effective for 596 analyzing responses to nonstationary signals with unknown parameters, it does not 597 explicitly incorporate information about the stimulus, which is often designed by the 598 experimenter. Since the spectrogram relies on a narrow moving temporal window, it 599 offers poor spectral resolution due to the time-frequency uncertainty principle. Instead 600 of using conventional spectrogram analyses, frequency demodulation and filtering can 601 be used together to estimate power along a spectrotemporal trajectory more accurately 602 as we describe below. These spectrally specific analyses will facilitate more sensitive 603 metrics to investigate the coding differences between nonstationary features in natural 604 speech and extensively studied stationary features in synthetic stationary speech.

605
Frequency-demodution-based spectrotemporal filtering 606 First, we describe the spectrotemporal filtering technique using an example stimulus 607 with dynamic spectral components (Fig 11). The 2-second long stimulus consists of 608 three spectrotemporal trajectories: (1) a stationary tone at 1.4 kHz, (2) a stationary 609 tone at 2 kHz, and (3) a dynamic linear chirp that moves from 400 to 800 Hz over the 610 stimulus duration. We are interested in estimating the power of the nonstationary 611 component, the linear chirp. In order to estimate the power of this chirp, conventional 612 spectrograms will employ one of the following two approaches. First, one can use a long 613 window (e.g., 2 seconds) and compute power over the 400-Hz bandwidth from 400 to 614 800 Hz. In the second approach, one can use moving windows that are shorter in 615 duration (e.g., 50 ms) and compute power with a resolution of 30 Hz (20-Hz imposed by 616 inverse of the window duration and 10-Hz imposed by change in chirp frequency over 50 617 ms). As an alternative to these conventional approaches, one can demodulate the 618 spectral trajectory of the linear chirp so that the chirp is demodulated to near 0 Hz (Fig 619  11C and 11D, see Materials and Methods). Then, a low-pass filter with 0.5-Hz The energy related to the target spectrotemporal trajectory is spread over a wide frequency range (400 to 800 Hz, red line). (C) The spectrogram of the frequency-demodulated signal, where the target trajectory was used for demodulation (i.e., shifted down to 0 Hz). (D) The magnitude-DFT of the frequency-demodulated signal. The desired trajectory is now centered at 0 Hz, with its (spectral) energy spread limited only by the signal duration (i.e., equal to the inverse of the signal duration), and hence, is much narrower.
July 13, 2020 23/51 bandwidth (as determined by the reciprocal of the 2-s stimulus duration) can be 621 employed to estimate the time-varying power along the chirp trajectory. This 622 time-varying power is estimated at the stimulus sampling rate, similar to the temporal 623 sampling of the output of a band-pass filter applied on a stationary signals. While the 624 same temporal sampling can be achieved using the spectrogram by sliding the window 625 by one sample and estimating the chirp-related power for each window, it will be 626 computationally much more expensive compared to the frequency-demodulation-based 627 approach. Furthermore, the spectral resolution of 0.5 Hz is the same as that for a 628 stationary signal, which demonstrates a 60-fold improvement compared to the 50-ms 629 window-based spectrogram approach.

630
The harmonicgram for synthesized nonstationary speech harmonic numbers, F 1 (t)/F 0 (t) and F 2 (t)/F 0 (t), which are known a priori in this case. 649 Two harmonicgrams were constructed using responses from two AN fiber pools: (1) AN 650 fibers that had a low CF (CF < 1 kHz), and (2) AN fibers that had a medium CF (1 651 kHz < CF < 2.5 kHz). Previous neurophysiological studies have shown that AN fibers 652 with CF near and slightly above a formant strongly synchronize to that formant, 653 especially at moderate to high intensities (Young and Sachs, 1979;Delgutte and Kiang, 654 1984a). Therefore, the low-CF pool was expected to capture F 1 , which changed from 655 630 Hz to 570 Hz. Similarly, the medium-CF pool was expected to capture F 2 , which 656 changed from 1200 Hz to 1500 Hz. The harmonicgram for each pool was constructed by 657 using the average Hilbert-phase PSTH, φ(t), of all AN fibers in the pool. The 658 harmonicgram is shown from 38 ms to 188 ms to optimize the dynamic range to visually 659 highlight the formant transitions by ignoring the onset response. The dominant 660 component in the neural response for F 1 was expected at the harmonic number closest 661 to F 1 /F 0 . For this stimulus, F 1 /F 0 started at a value of 6.3 (630/100) and reached 4.75 662 (570/120) at 188 ms crossing 5.5 at 88.5 ms (Fig 12A). This transition of F 1 /F 0 was 663 faithfully represented in the harmonicgram where the dominant response switched from 664 the 6th to the 5th harmonic near 90 ms. Similarly, F 2 /F 0 started at 12, consistent with 665 the dominant response at the 12th harmonic before 100 ms (Fig 12B). Towards the end 666 of the stimulus, F 2 /F 0 reached 12.5, which is consistent with the near-equal power in   , 1999). A 20-Hz bandwidth was employed to low-pass filter the demodulated signal for each harmonic. The harmonicgram for each AN-fiber pool was constructed by averaging the Hilbert-phase PSTHs of all AN fibers within the pool. PSTH bin width = 50 µs. Data are from one chinchilla. The black, purple and, red lines represent the fundamental frequency (F 0 /F 0 ), the first formant (F 1 /F 0 ) and the second formant (F 2 /F 0 ) contours, respectively. The time-varying formant frequencies were normalized by the time-varying F 0 to convert the spectrotemporal representation into a harmonicgram.

The harmonicgram for natural speech 673
The harmonicgram analysis is not limited to synthesized vowels, but it can also be 674 applied to natural speech (Fig 13). These harmonicgrams were constructed for the 675 natural speech stimulus, s 3 , using average φ(t) for the same low-CF and medium-CF 676 AN fiber pools that were used in Fig 12. Here, we consider a 500-ms segment of the 677 stimulus, which contains multiple phonemes. Qualitatively, similar to Fig 12,  sum of power in the three harmonics closest to the F 0 -normalized formant frequency at 691 that time [e.g., F 1 (t)/F 0 (t) for the first formant]. As expected, power for the harmonics 692 near the first formant was substantially greater than that for the second formant for the 693 Fig 13. The harmonicgram can be used to quantify the coding of time-varying stimulus features at superior spectrotemporal resolution compared to the spectrogram. Harmonicgrams were constructed using φ(t) for the same two AN-fiber pools described in Fig 12. PSTH bin width = 50 µs. A 9-Hz bandwidth was employed to low-pass filter the demodulated signal for each harmonic. The data were collected from one chinchilla in response to the speech stimulus, s 3 . Stimulus intensity = 65 dB SPL. A 500-ms segment corresponding to the voiced phrase "amle" was considered. (A, B) Spectrograms constructed from the average φ(t) for the low-CF pool (A) and from the medium-CF pool (B). (C, D) Average harmonicgrams for the same set of fibers as in A and B, respectively. Warm (cool) colors represent regions of high (low) power. The first-formant contour (F 1 in A and B, F 1 /F 0 in C and D) is highlighted in purple. The second-formant contour (F 2 in A and B, F 2 /F 0 in C and D) is highlighted in red. Trajectories of the fundamental frequency (black in A and B, right Y axis) and the formants were obtained using Praat (Boersma, 2001). (E, F) Harmonicgram power near the first formant (purple) and the second formant (red) for the low-CF pool (E) and the medium-CF pool (F). Harmonicgram power for each formant at any given time (t) was computed by summing the power in the three closest F 0 harmonics adjacent to the normalized formant contour [e.g., F 1 (t)/F 0 (t)] at that time. The noise floor (NF) for power was estimated as the sum of power for the 29th, 30th, and 31st harmonics of F 0 because the frequencies corresponding to these harmonics were well above the CFs of both fibers pools. These time-varying harmonicgram power metrics are spectrally specific to F 0 harmonics and are computed with high temporal sampling rate (same as the original signal). This spectrotemporal resolution is much better than the spectrotemporal resolution that can be obtained using spectrograms.
low-CF pool (Fig 13E). For the medium-CF pool, F 2 representation was robust over the 694 whole stimulus duration, although F 1 representation was largely comparable (Fig 13F). 695 July 13, 2020 26/51 These examples demonstrate novel analyses using the apPSTH -based harmonicgram to 696 quantify time-varying stimulus features in single-unit neural responses at high 697 spectrotemporal resolution that are not possible with conventional windowing-based 698 approaches.

699
The harmonicgram can also be used to analyze FFRs in 700 response to natural speech 701 As mentioned earlier, a major benefit of using apPSTHs to analyze spike trains is that 702 the same analyses can also be applied to evoked far-field potentials. In Fig 14, the 703 harmonicgram analysis was applied to the difference FFR recorded in response to the 704 same speech sentence (s 3 ) that was used in Fig 13. In fact, these FFR data and 705 spike-train data used in Fig 13 were  these results reinforce the idea that using apPSTHs to analyze spike trains offers the 719 same spectrally specific analyses that can be applied to evoked far-field potentials, e.g., 720 the FFR, thus allowing a unifying framework to study temporal coding for both 721 stationary and nonstationary signals in the auditory system. are different (i.e., continuous-valued waveforms versus point-process spike trains). The 731 present report provides a unifying framework for analyzing spike trains using apPSTHs, 732 which offers numerous benefits over previous neurophysiological analyses. Specifically, 733 the use of apPSTHs incorporates many of the previous ad-hoc approaches, such as VS 734 and correlograms (Eqs 3 to 5). In fact, correlograms and metrics derived from them can 735 be estimated using apPSTHs in a computationally efficient way. The apPSTHs 736 essentially convert the naturally rectified neurophysiological point-process data into a 737 continuous-valued signal, which allows advanced signal processing tools designed for 738 continuous-valued signals to be applied to spike-train data. For example, apPSTHs can 739 be used to derive spectrally specific TFS components [e.g., φ(t) , Fig 7], multitaper 740 spectra (Fig 4), modulation-domain representations (Fig 5), and harmonicgrams ( 12 and 13). apPSTHs can also be directly compared to evoked far-field responses for 742 both stationary and nonstationary stimuli (e.g., Figs 13 and 14).

743
Temporal coding metrics should be spectrally specific

744
The various analyses explored in this report advocate for spectral specificity of temporal 745 coding metrics in a variety of ways. The need for spectrally specific analyses arises for 746 two reasons: (1) neural data is finite and inherently stochastic, and (2) spike-train data 747 are rectified. Neural stochasticity exacerbates the spectral-estimate variance at all 748 frequencies; therefore, time-domain (equivalently broadband) metrics will be noisier 749 compared to narrowband metrics. Similarly, the rectified nature of spike-train data can 750 introduce harmonic distortions in the response spectrum, which can corrupt broadband 751 metrics (e.g., TFS distortion at two times the carrier frequency corrupting estimates of 752 ENV coding, Figs 6A and 6B).

753
These issues requiring spectral specificity are not unique to the apPSTH analyses 754 but also apply to classic metrics, e.g., correlograms. For example, the broadband 755 correlation index (CI) metric is appropriate to analyze responses of neurons with high 756 CFs, but the CI metric is corrupted by rectifier distortions for neurons with low 757 CFs (Joris et al., 2006;Heinz and Swaminathan, 2009). Studies have previously tried to 758 avoid these distortions in the sumcor by restricting the response bandwidth to below 759 the CF because, for a given filter, the envelope bandwidth cannot be greater than the 760 filter bandwidth (Heinz and Swaminathan, 2009;Kale and Heinz, 2010).

761
Here, we have extended and generalized the analysis of these issues using 762 narrowband stimuli. In particular, when a neuron responds to low-frequency stimulus 763 energy that is below half the phase-locking cutoff, responses that contain any 764 polarity-tolerant component [e.g., p(t), n(t), s(t), SAC, and sumcor ] will be corrupted 765 by rectifier distortion of the polarity-sensitive component (Fig 6E). Any broadband 766 metric of temporal coding should exclude these distortions at twice the carrier 767 frequency. Beyond avoiding rectifier distortion, limiting the bandwidth of a metric to 768 only the desired bands will lead to more precise analyses by minimizing the effects of 769 neural stochasticity (Fig 6H). For example, envelope coding metrics for SAM-tone 770 stimuli should consider the spectrum power only at F m and its harmonics (Vasilkov and 771 Verhulst, 2019), rather than the simple approach of always low-pass filtering at 772 CF (Heinz and Swaminathan, 2009).

773
Similar to envelope-based metrics, metrics that quantify TFS coding should also be 774 spectrally specific to the carrier frequency. Previous metrics of TFS coding, such as d(t) 775 and difcor, are not specific to the carrier frequency but rather include modulation 776 sidebands as well as additional sidebands due to transduction nonlinearities (Fig 7). In 777 contrast, φ(t) introduced here emphasizes the carrier and suppresses the sidebands (Fig 778  7). Thus, the spectrally specific φ(t) is a better TFS response, which relates to the 779 zero-crossing signal used in the signal processing literature (Voelcker, 1966;Logan, 1977;780 Wiley, 1981).

781
Spectral-estimation benefits of using apPSTHs 782 Neurophysiological studies have usually favored the DFT to estimate the response 783 spectrum. For example, the DFT has been applied to the period histogram (Young and 784 Sachs, 1979;Delgutte and Kiang, 1984a), the single-polarity PSTH (Miller and Sachs, 785 1983;Carney and Geisler, 1986), the difference PSTH (Sinex and Geisler, 1983), and 786 correlograms (Louage et al., 2004). Since spike-train data are stochastic and usually 787 sparse and finite, there is great scope for spectral estimates, including the DFT 788 spectrum, to suffer from bias and variance issues. The multitaper approach optimally 789 uses the available data to minimize the bias and variance of the spectral 790 estimate (Thomson, 1982;Percival and Walden, 1993;Babadi and Brown, 2014). The 791 multitaper approach can be used with both apPSTHs and correlograms, but using 792 apPSTHs offers additional variance improvement up to a factor of 2 (Fig 4). This 793 improvement is because twice as many tapers (both odd and even) can be used with an 794 apPSTH compared to a correlogram, which is an even sequence and limits analyses to 795 only using even tapers.

796
apPSTHs allow animal models of sensorineural hearing loss to 797 be linked to psychophysical speech-intelligibility models 798 Speech-intelligibility models not only improve our understanding of perceptually 799 relevant speech features, they can also be used to optimize hearing-aid and 800 cochlear-implant strategies. However, existing SI models work well for normal-hearing 801 listeners but have not been widely extended for hearing-impaired listeners. This gap is 802 largely because of the fact that most SI models are based on signal-processing 803 algorithms in the acoustic domain, where individual differences in the physiological 804 effects of various forms of sensorineural hearing loss on speech coding are difficult to 805 evaluate. This gap can be addressed by extending acoustic SI models to the neural 806 spike-train domain. In particular, spike-train data obtained from preclinical animal 807 models of sensorineural hearing loss can be used to explore the neural correlates of 808 perceptual deficits faced by hearing-impaired listeners (Trevino et al., 2019). These 809 insights will be crucial for developing accurate SI models for hearing-impaired listeners. 810 apPSTHs offer a straightforward means to study various speech features in the representations so that envelope-based SI models can be evaluated in the neural domain 820 (Fig 5). Similarly, the Hilbert-phase PSTH, φ(t), can be used to evaluate the neural 821 representation of TFS features. These TFS results will be particularly insightful for 822 cochlear-implant stimulation strategies that rely on the zero-crossing component of the 823 stimulus, which closely relates to φ(t) (Grayden et al., 2004;Chen and Zhang, 2011).

824
Benefits of spectrotemporal filtering 825 Analysis of neural responses to nonstationary signals has been traditionally carried out 826 using windowing-based approaches, such as the spectrogram. Shorter windows help with 827 tracking rapid temporal structures, but they offer poorer spectral resolution. On the 828 other hand, larger windows allow better spectral resolution at the cost of smearing rapid 829 dynamic features. As an alternative to windowing-based approaches, spectrotemporal 830 filtering can improve the spectral resolution of analyses by taking advantage of stimulus 831 parameters that are known a priori (Fig 11). This approach is particularly efficient to 832 analyze spectrally sparse signals (i.e., signals with instantaneous line spectra, such as 833 voiced speech). In particular, the spectral resolution is substantially improved compared 834 to the spectrogram. In addition, while the same temporal sampling can be obtained 835 using the spectrogram, it will be much more computationally expensive compared to the 836 spectrotemporal filtering approach, as discussed in the following example.

837
Consider the signal in Fig 11. The neural response for this signal will have noise over 838 the whole response bandwidth due to neural stochasticity, in addition to the 839 representation of the signal components. Let us assume that we are interested in 840 comparing the coding of the 1400-Hz stationary tone and the linear chirp (400 to 800 841 Hz sweep over 2 seconds). The coding of these signals can be quantified by estimating 842 the power of these components in the response. For the stationary tone, power can be 843 estimated over a 0.5-Hz band around 1400 Hz. For the chirp, one of the following 844 approaches can be employed. Power can be estimated in a long temporal window, which 845 improves the spectral resolution of the analysis. However, estimating power for the chirp 846 will require a 400 Hz spectral window, which is determined by the chirp bandwidth.

847
Since at any given time the signal has power only at three frequencies, use of such a 848 large spectral window will allow the response noise to introduce significant bias to the 849 estimated chirp-related power in the response since power is a nonnegative random 850 variable. On the other extreme, one could use a shorter window, say 2 ms, such that the 851 chirp frequency is nearly constant over this window. However, a 500-Hz bandwidth 852 limitation is posed due to time-frequency uncertainty. In contrast to the spectrogram, 853 the spectrotemporal filtering approach demodulates the chirp trajectory onto a single 854 frequency, and thus, has a spectral resolution of 0.5 Hz (inverse of signal duration).

855
July 13, 2020 30/51 This spectral resolution is the same as the one used for the 1400-Hz stationary tone, 856 permitting an equivalent comparison. Moreover, the filtered signal has the same 857 temporal sampling rate as the original response. Achieving the same temporal sampling 858 using spectrograms will be computationally extremely expensive because one has to slide 859 the window by one sample and compute the PSD for each window. Thus, by combining 860 apPSTHs with advanced signal processing approaches, the response to the 1400-Hz tone 861 can be compared with the response to the chirp, at the narrowest spectral resolution 862 possible (inverse of the signal duration) and at the original temporal sampling rate.

863
The benefits of spectrotemporal filtering extend to other spectrally sparse signals, 864 like harmonic complexes. A priori knowledge of the fundamental frequency can be used 865 to construct the harmonicgram, which takes advantage of power concentration at 866 harmonics of F 0 . Such an approach contrasts with the spectrogram, which computes 867 power at all frequencies uniformly. The harmonicgram can be used to analyze both 868 kinematic synthesized vowels (Fig 12) as well as natural speech (Fig 13). The 869 harmonicgram is particularly useful in quantifying dominant harmonics in the response 870 at high temporal sampling, and is thus applicable to nonstationary signals. The 871 harmonicgram can also be applied to evoked far-field potentials (e.g., the FFR in Fig   872   14). While alternatives exist to analyze spike-train data in response to time-varying 873 stimuli (Brown et al., 2002), the present spectrotemporal technique is simpler and can 874 be directly applied to both spike-train data and far-field responses. Overall, these 875 results support the idea that using apPSTHs to analyze spike trains provides a unifying 876 framework to study temporal coding in the auditory system across modalities. the PSTHs that require two polarities to be estimated, e.g., s(t), d(t), and φ(t), may 886 not have an "internal representation" in the brain. This limitations also applies to 887 correlogram metrics based on sumcor and difcor, which require two polarities of the 888 stimulus. Thus, the use of the single-polarity PSTH [p(t)] to derive the central "internal 889 representations" is more appropriate from a biological feasibility perspective (e.g., Fig 890  5). However, these various ENV/TFS components allow a thorough characterization of 891 the processing of spectrotemporally complex signals by the nonlinear auditory system 892 and can guide the development of more accurate speech-intelligibility models and help 893 improve signal processing strategies for hearing-impaired listeners.

894
Alternating-polarity stimuli 895 Use of two polarities may not be sufficient to separate out all the components 896 underlying the neural response when more than two components contribute to the 897 neural response at a given frequency. In particular, it may be intractable to separate 898 out rectifier distortion when the bandwidths of ENV and TFS in the response overlap. 899 For example, consider the response of a broadly tuned AN fiber in response to a vowel, 900 which has a fundamental frequency of F 0 . The energy at 2F 0 in S(f ) may reflect one or 901 more of the following sources: (1) rectifier distortion to carrier energy at F 0 , (2) beating 902 between (carrier) harmonics that are separated by 2F 0 , and (3) effects of transduction 903 nonlinearities on the beating between (carrier) harmonics that are separated by F 0 . In 904 these special cases, additional stimulus phase variations can be used to separate out 905 these components (Billings and Zhang, 1994;Lucchetti et al., 2018).

907
A key drawback of applying the harmonicgram to natural speech is the requirement of 908 knowing the F 0 trajectory. F 0 estimation is a difficult problem, especially in degraded 909 speech. Thus, the harmonicgram could be inaccurate unless the F 0 trajectory is known, 910 or at least the original stimulus is known so that F 0 can be estimated. were recorded using subdermal electrodes in a vertical montage (mastoid to vertex with 929 common ground near the nose) under the same ketamine/xylazine anesthesia induction 930 protocol described above using standard procedures in our laboratory (Zhong et al., 931 2014 Table 2. List of terms and definitions.

Term Definition
Electrophysiology Studies that record and analyze far-field (gross) potentials, e.g., electroencephalography Neurophysiology Studies that record and analyze spike-train data from neurons, e.g., AN fiber spike trains Stationarity A signal is stationary when the signal parameters do not change over time. For example, a stochastic signal like white Gaussian noise is stationary if the amplitude probability density function is constant across time. Similarly, a deterministic pure tone can be considered an example of a stationary sinusoidal process with a particular amplitude, frequency, and initial phase.
Second-order stationarity A stochastic signal is second-order stationary if its mean and autocorrelation function do not change over time. Second-order stationarity is also referred to as wide-sense stationarity.

Linearity
A system is linear if it obeys the rules of superposition. For example, consider a system for which inputs x 1 and x 2 evoke responses y 1 and y 2 , respectively. Then, the system is linear if the response to input ax 1 + bx 2 is ay 1 + by 2 . An auditory corollary of linearity is that a linear system (e.g., the ear canal) processes sound in the same way at soft and loud sound levels, which means that for every dB increase in the input, the output is increased by the same dB. Power along a spectro-temporal trajectory 953 Consider a known frequency trajectory, f traj (t), along which we need to estimate power in a signal, x(t). The phase trajectory, Φtraj(t), can be computed as the integration of July 13, 2020 33/51 For discrete-time signals, the phase trajectory can be estimated as The phase trajectory can be demodulated from x(t) by multiplying a complex 954 exponential with phase = −Φ traj (t) (Olhede and Walden, 2005) The power along f traj (t) in x(t) can be estimated as the power in x demod (t) within 956 the spectral resolution bandwidth (W) near 0 Hz in the spectral estimate, P x demod (f ), of 957 x demod (t).

958
The scaling factor 2 is required because the integral in Eq 11 only represents the 959 original positive-frequency band of the real signal, x(t); the equal amount of power 960 within the original negative-frequency band, which is shifted further away from 0 Hz by 961 Φ traj (t), should also be included (see Fig 11).

963
Consider a harmonic complex, x(t), with a time-varying (instantaneous) fundamental 964 frequency, F 0 (t). For a well-behaved and smooth F 0 (t), energy in x(t) will be 965 concentrated at multiples of the instantaneous fundamental frequency, i.e., kF 0 (t).

966
Thus, x(t) can be represented by the energy distributed across the harmonics of the 967 fundamental. The time-varying power along the k-th harmonic of F 0 (t) can be 968 estimated by first demodulating x(t) with the kF 0 (t) trajectory using Eq 10, and then 969 using an appropriate low-pass filter to limit energy near 0 Hz (say within ±W/2).  were constructed using φ(t).

1008
Supporting information S1 Appendix. Vector strength metric definitions Vector Strength. The vector strength (VS ) metric is used to quantify how well spikes in a spike train are synchronized to a frequency, f (Goldberg and Brown, 1969;Johnson, 1980). Let us denote a spike train with N spikes as ζ such that ζ = {t 1 , t 2 , ..., t N } and the {t i }s are individual spike times. To compute the vector strength, these spike times are first transformed onto the unit circle such that t i maps to z i as The mean of the set of complex vectors corresponding to all N spikes is Then, VS at frequency f is defined as the magnitude of ρ(f ).
Phase-projected Vector Strength. The phase-projected vector strength (V S pp ) is identical to the VS for a single spike train (i.e., for a single stimulus repetition), but these metrics differ when multiple (R) stimulus repetitions are used. V S pp is advantageous relative to V S when there are relatively fewer spikes per repetition (Yin et al., 2010). To estimate V S pp at frequency f , the magnitude (i.e., V S) and phase [φ r (f )] of the mean complex vector are first calculated for individual repetitions using Eqs 13 and 14 (instead of pooling spike times across all R repetitions). The per-repetition VS estimates, called V S r (f ), are weighted by the cosine of the phase difference between φ r (f ) of the repetition and the mean phase based on all spikes from all repetitions, φ ref (f ), to estimate the phase-projected vector strength, V S r pp (f ), for the repetition.
and φ ref (f ) is computed using all spikes across all R repetitions as .
V S pp (f ) for R repetitions is computed as the mean V S r pp (f ) across all repetitions, Using Eq 15 in Eq 13, we get If we assume response phase locking to positive and negative polarity of a sinusoid (f 0 ) differ by a phase of π [i.e., a time difference of T 0 /2(= 1/2f 0 ) such that p(t) n(t)e 2πf T0/2 ], we can write For f = f 0 , the integral in Eq 17 will be zero.
S3 Appendix. Relation between shuffled correlograms and apPSTHs Consider X: a set of T X spike trains {ζ 1 , ζ 2 , ..., ζ T X } in response to a stimulus of duration D. For each spike train ζ i , we can construct a PSTH, x i , with PSTH bin width ∆ so that the length of the single-trial PSTH x i is M = D/∆. The single-trial PSTH is a binary-valued vector because each element in the vector is either 0 or 1. Let us denote the PSTH for X by P ST H X such that P ST H X = T X i=1 x i . Consider Y: another set of T Y spike trains, with y i and P ST H Y defined similarly to x i and P ST H X , respectively. Let us assume that the stimulus duration and bin width for y i are the same as that for x i . Let the average discharge rates (in spikes/s) for X and Y be r X and r Y , respectively. The shuffled cross-correlogram (SCC ) for two spike trains ζ i and ζ j computed using tallying (Louage et al., 2004) is identical to the cross-correlation function (denoted by R X Y ) between their respective PSTHs, (x i and x j ). Thus, the raw (not normalized) shuffled cross-correlogram (SCC raw ) at τ delay can be computed as SCC raw X,Y (τ ) = R X Y (x 1 , {y 1 , y 2 , ..., y T Y }) + ... + R X Y (x T X , {y 1 , y 2 , ..., y T Y }) = R X Y (x 1 , [y 1 + y 2 + ... + y T Y ]) + ...+ (20) where SCC norm is the normalized SCC (Louage et al., 2004;Heinz and Swaminathan, 2009). Similarly, the raw shuffled autocorrelogram (SAC raw ) at τ delay can be computed as, SAC raw X (τ ) = R X Y (x 1 , {x 2 , x 3 , ..., x T X }) + R X Y (x 2 , {x 1 , x 3 , ..., x T X }) + ...
where R X denotes the autocorrelation function. Similar to autocorrelation functions, the SAC norm has its maximum at zero delay.
In the numerator of Eq 22, the term T X i=1 R X (x i ) is negligible compared to R X (P ST H X ) for τ = 0. For τ = 0, T X i=1 R X (x i ) is equal to the total number of spikes July 13, 2020 43/51 (N ) in X. Thus, Eq 22 can be further approximated by, where N = r X DT X , and δ is the Dirac delta function. The simplifying approximation in Eq 24 is valid for typically used T X values in neurophysiological experiments, and equates the normalization factors between SACs and SCCs when working with difcor and sumcor (e.g., S4 Appendix). Eqs 21 to 24 indicate that correlograms can be computed much more efficiently using apPSTHs instead of by tallying spike times [O(N ) instead of O(N 2 ), see main text].
S4 Appendix. Relation between difcor/sumcor and difference/sum PSTHs Consider X + : spike trains in response to the positive polarity of a stimulus, and X − : spike trains in response to the negative polarity of the stimulus. Then, the difcor at τ delay can be computed as dif cor X (τ ) = 1 2 For analytic simplicity, we use Eq 24 for SAC norm instead of Eq 22. Let us assume that the number of repetitions and average rates for both polarities are the same. Thus, where K = T 2 X r 2 X D∆ is a constant. Now, P ST H X+ = p(t), P ST H X− = n(t), and the difference PSTH d(t) = [p(t) − n(t)] /2. Then, the difcor for X at delay τ is Now, July 13, 2020 45/51 Thus, Similarly, it can be shown that where s(t) is the sum PSTH, i.e., s(t) = [p(t) + n(t)]/2. Eqs 25 and 26 indicate that sumcor and difcor are related to the autocorrelation function of the sum and difference PSTHs, respectively, and thus can be computed much more efficiently [O(N ) rather than O(N 2 )]. S5 Appendix. Relation between shuffled-correlogram peak-height and apPSTHs Consider a difference PSTH, d(t), based on a set of spike trains X in response to a stimulus of duration D. Let us denote the Fourier transform of d(t) by D(f ). Then, from Eq 25, the difcor peak-height, i.e., difcor value at zero delay (τ ), can be computed as , (by Parseval's theorem) Following similar steps from Eq 26, it can also be shown that the sumcor peak-height can be computed as where S(f ) is the Fourier transform of the sum PSTH, s(t).
Comparing Eq 19 with Eqs. 27 and 28, we see that vector strength is a frequency-specific metric, whereas correlogram peak-heights are broadband measures, which are thus susceptible to rectifier distortion (see Fig 6). S1 Table. Parameters for the AN model Fractional Gaussian noise type 0 (fixed) Implementation type of the power-law functions in the Synapse 0 (approximate) Spike time resolution 10 µs List of parameters used in the AN model to generate simulated spike-train data.
July 13, 2020 48/51 S1 Fig. Nonlinear inner-hair-cell transduction function introduces additional sidebands in the spectrum for a SAM tone. (C) Waveform of the output after processing the SAM tone through a sigmoid function. The sigmoid function was used as a simple proxy for the inner-hair-cell transduction function. This output (vIHC) was further low-pass filtered at 2 kHz to mimic the membrane properties of inner hair cells. (D) D(f ) and S(f ) for the signal in C. In addition to having power at F c and F c ± F m , D(f ) for vIHC has substantial energy at F c ± 2F m (plus reduced energy at higher multiple F m -offsets from F c ). Similarly, S(f ) for vIHC has substantial energy at F m as well as at the first few harmonics of F m . S(f ) is also corrupted by rectifier distortion at 2F c (and multiple F m -offsets from 2F c ) as expected.