Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Exposing distinct subcortical components of the auditory brainstem response evoked by continuous naturalistic speech

View ORCID ProfileMelissa J Polonenko, View ORCID ProfileRoss K Maddox
doi: https://doi.org/10.1101/2020.08.20.258301
Melissa J Polonenko
Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, Center for Visual Science, University of Rochester
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Melissa J Polonenko
Ross K Maddox
Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, Center for Visual Science, University of Rochester
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ross K Maddox
  • For correspondence: ross.maddox@rochester.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

ABSTRACT

The auditory brainstem is important for processing speech, yet we have much to learn regarding the contributions of different subcortical structures. These deep neural generators respond quickly, making them difficult to study during dynamic, ongoing speech. Recently developed techniques have paved the way to use natural speech stimuli, but the auditory brainstem responses (ABRs) they provide are temporally broad and thus have ambiguous neural sources. Here we describe a new method that uses re-synthesized “peaky” speech stimuli and deconvolution analysis of EEG data to measure canonical ABRs to continuous naturalistic speech of male and female narrators. We show that in adults with normal hearing, peaky speech quickly yields robust ABRs that can be used to investigate speech processing at distinct subcortical structures from auditory nerve to rostral brainstem. We further demonstrate the versatility of peaky speech by simultaneously measuring bilateral and ear-specific responses across different frequency bands. Thus, the peaky speech method holds promise as a powerful tool for investigating speech processing and for clinical applications.

INTRODUCTION

Understanding speech is an important, complex process that spans the auditory system from cochlea to cortex. A temporally precise network transforms the strikingly dynamic fluctuations in amplitude and spectral content of natural, ongoing speech into meaningful information, and modifies that information based on attention or other priors (Mesgarani et al., 2009). Subcortical structures play a critical role in this process – they do not merely relay information from the periphery to the cortex, but also perform important functions for speech understanding, such as localizing sound (e.g., Grothe and Pecka, 2014) and encoding vowels across different levels and in background noise (e.g., Carney et al., 2015). Furthermore, subcortical structures receive descending information from the cortex through corticofugal pathways (Bajo et al., 2010; Bajo and King, 2012; Winer, 2005), suggesting they may also play an important role in modulating speech and auditory streaming. Given the complexity of speech processing, it is important to parse and understand contributions from different neural generators. However, these subcortical structures are deep and respond to stimuli with very short latencies, making them difficult to study during ecologically-salient stimuli such as continuous and naturalistic speech. We created a novel paradigm aimed at elucidating the contributions from distinct subcortical structures to ongoing, naturalistic speech.

Activity in deep brainstem structures can be “imaged” by the latency of waves in a surface electrical potential (electroencephalography, EEG) called the auditory brainstem response (ABR). The ABR’s component waves have been attributed to activity in different subcortical structures with characteristic latencies: the auditory nerve contributes to waves I and II (~1.5–3 ms), the cochlear nucleus to wave III (~4 ms), the superior olivary complex and lateral lemniscus to wave IV (~ 5 ms), and the lateral lemniscus and inferior colliculus to wave V (~6 ms) (Møller and Jannetta, 1983; review by Moore, 1987; Starr and Hamilton, 1976). Waves I, III, and V are most often easily distinguished in the human response. Subcortical structures may also contribute to the earlier P0 (12–14 ms) and Na (15–25 ms) waves (Hashimoto, 1982; Kileny et al., 1987; Picton et al., 1974) of the middle latency response (MLR), which are then followed by thalamo-cortically generated waves Pa, Nb, and Pb/P1 (Geisler et al., 1958; Goldstein and Rodman, 1967). ABR and MLR waves have a low signal-to-noise ratio (SNR) and require numerous stimulus repetitions to record a good response. Furthermore, they are quick and often occur before the stimulus has ended. Therefore, out of necessity, most human brainstem studies have focused on brief stimuli such as clicks, tone pips, or speech syllables, rather than more natural speech.

Recent analytical techniques have overcome limitations on stimuli, allowing continuous naturally uttered speech to be used. One such technique extracts the fundamental waveform from the speech stimulus and finds the envelope of the cross-correlation between that waveform and the recorded EEG data (Forte et al., 2017). The response has an average peak time of about 9 ms, with contributions primarily from the inferior colliculus (Saiz-Alia and Reichenbach, 2020). A second technique considers the rectified broadband speech waveform as the input to a linear system and the EEG data as the output, and uses deconvolution to compute the ABR waveform as the impulse response of the system (Maddox and Lee, 2018). The speech-derived ABR shows a wave V peak whose latency is highly correlated with the click response wave V across subjects, demonstrating that the component is generated in the rostral brainstem. A third technique averages responses to each chirp (click-like transients that quickly increase in frequency) in re-synthesized “cheech” stimuli (CHirp spEECH; Miller et al., 2017) that interleaves alternating octave frequency bands of speech and of chirps aligned with some glottal pulses (Backer et al., 2019). Brainstem responses to these stimuli also show a wave V, but do not show earlier waves unless presented monaurally over headphones (Backer et al., 2019; Miller et al., 2017). While these methods reflect subcortical activity, the first two provide temporally broad responses with a lack of specificity regarding underlying neural sources. None of the three methods shows the earlier canonical components such as waves I and III that would allow rostral brainstem activity to be distinguished from, for example, the auditory nerve. Such activity is important to assess, especially given the current interest in the potential contributions of auditory nerve loss in disordered processing of speech in noise (Bramhall et al., 2019; Liberman et al., 2016; Prendergast et al., 2017).

We asked if we could further assess underlying speech processing in multiple distinct early stages of the auditory system by 1) evoking additional waves than wave V of the canonical ABR, and 2) measuring responses to different frequency ranges of speech (corresponding to different places of origin on the cochlea). The ABR is strongest to very short stimuli such as clicks, so we created “peaky” speech. The design goal of peaky speech is to re-synthesize natural speech so that its defining spectrotemporal content is unaltered but its pressure waveform consists of maximally sharp peaks so that it drives the ABR as effectively as possible. The results show that peaky speech evokes canonical brainstem responses and frequency-specific responses, paving the way for novel studies of subcortical contributions to speech processing.

RESULTS

Broadband peaky speech yields more robust responses than unaltered speech

Broadband peaky speech elicits canonical brainstem responses

In previous work, brainstem responses to natural, on-going speech exhibited a temporally broad wave V but no earlier waves (Maddox and Lee, 2018). We re-synthesized speech to be “peaky” with the primary aim to evoke additional, earlier waves of the ABR that identify different neural generators. Indeed, Figure 1 shows that waves I, III, and V of the canonical ABR are clearly visible in the group average and in the individual responses to broadband peaky speech. This means that broadband peaky speech, unlike the unaltered speech, can be used to assess naturalistic speech processing at discrete parts of the subcortical auditory system, from the auditory nerve to rostral brainstem. These responses represent weighted averaged data from ~43 minutes of continuous speech (40 epochs of 64 s each), and were filtered at a typical high-pass cutoff of 150 Hz to highlight the earlier ABR waves.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1.

Single subject and group average (bottom right) weighted-average auditory brainstem responses (ABR) to ~43 minutes of broadband peaky speech. Area for the group average shows ± 1 SEM. Responses were high-pass filtered at 150 Hz using a first order Butterworth filter. Waves I, III, and V of the canonical ABR are evident in most of the single subject responses (N = 22, 16, 22 respectively), and are marked by the average peak latencies on the average response.

Morphology of the broadband peaky speech ABR was inspected and waves marked by a trained audiologist (MJP) on 2 occasions that were 3 months apart. The intraclass correlation coefficients for absolute agreement (ICC3) were ≥ 0.91 (lowest ICC3 95% confidence interval was 0.78–0.96 for wave I, p < 0.01), indicating excellent reliability for chosen peak latencies. Waves I and V were identifiable in responses from all subjects (N = 22), and wave III was clearly identifiable in 16 of the 22 subjects. These waves are marked on the individual responses in Figure 1. Mean ± SEM peak latencies for ABR waves I, III, and V were 2.95 ± 0.10 ms, 5.11 ± 0.09 ms, and 6.96 ± 0.07 ms respectively. These mean peak latencies are shown superimposed on the group average response in Figure 1 (bottom right). Inter-wave latencies were 2.13 ± 0.05 ms (N = 16) for I–III, 1.78 ± 0.06 ms (N = 16) for III–V, and 4.01 ± 0.07 (N = 22) for I–V. These peak inter-wave latencies fall within a range expected for brainstem responses but the absolute peak latencies were later than those reported for a click ABR at a level of 60 dB sensation level (SL) and rate between 50 to 100 Hz (Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977).

More components of the ABR and MLR are present with broadband peaky than unaltered speech

Having established that broadband peaky speech evokes robust canonical ABRs, we next compared both ABR and MLR responses to those evoked by unaltered speech. To simultaneously evaluate ABR and MLR components, a high-pass filter with a 30 Hz cutoff was used on the responses to ~43 minutes of each type of speech. Figure 2A shows that overall there were morphological similarities between responses to both types of speech; however, there were more early and late component waves to broadband peaky speech. More specifically, whereas both types of speech evoked waves V, Na and Pa, broadband peaky speech also evoked waves I, often III (14–16 of 22 subjects depending if a 30 or 150 Hz high-pass filter cutoff was used), and P0. With a lower cutoff for the high-pass filter, wave III rode on the slope of wave V and was less identifiable in the grand average shown in Figure 2A than that shown with a higher cutoff in Figure 1. Wave V was more robust and sharper to broadband peaky speech but peaked slightly later than the broader wave V to unaltered speech. For reasons unknown to us, the half-rectified speech method missed the MLR wave P0, and consequently had a broader and earlier Na than the broadband peaky speech method, though this missing P0 was consistent with the results of Maddox and Lee (2018). These waveforms indicate that broadband peaky speech is better than unaltered speech at evoking canonical responses that distinguish activity from distinct subcortical and cortical neural generators.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2.

Comparison of auditory brainstem (ABR) and middle latency responses (MLR) to ~43 minutes each of unaltered speech and broadband peaky speech. (A) The average waveform to broadband peaky speech (blue) shows additional, and sharper, waves of the canonical ABR and MLR than the broader average waveform to unaltered speech (black). Responses were high-pass filtered at 30 Hz with a first order Butterworth filter. Areas show ± 1 SEM. (B) Comparison of peak latencies for ABR wave V (circles) and MLR waves Na (downward triangles) and Pa (upward triangles) that were common between responses to broadband peaky and unaltered speech. Blue symbols depict individual subjects and black symbols depict the mean.

Peak latencies for the waves common to both types of speech are shown in Figure 2B. Again, there was good agreement in peak wave choices for each type of speech, with ICC3 ≥ 0.94 (the lowest two ICC3 95% confidence intervals were 0.87–0.98 and 0.92–0.99 for waves V and Na to unaltered speech respectively). As suggested by the waveforms in Figure 2A, mean ± SEM peak latency differences for waves V, Na, and Pa were longer for broadband peaky than unaltered speech by 0.29 ± 0.05 ms (independents t-test, t(21) = 5.4, p < 0.01, d = 1.19), 4.40 ± 0.43 ms (t(21) = 9.9, p < 0.01, d = 2.16), and 0.40 ± 0.39 ms (t(20) = 1.0, p = 0.33, d = 0.22) respectively.

We also verified that the EEG data collected in response to broadband peaky speech could be regressed with the half-wave rectified speech to generate a response. Details can be found in the supplemental material and Supplemental Figure 1A.

Broadband peaky speech responses differ across talkers

We next sought to determine whether response morphologies depended on the talker identity. Responses derived from unaltered speech show similar, but not identical, morphology between male and female narrators, indicating some dependence on the stimulus (Maddox and Lee, 2018). To determine what extent the morphology and robustness of peaky speech responses depend on a specific narrator’s voice, we compared waveforms and peak wave latencies for 32 minutes (30 epochs of 64 s each) each of male- and female-narrated broadband peaky speech in 11 subjects. The average fundamental frequency was 115 Hz for the male narrator and 198 Hz for the female narrator.

The group average waveforms to female- and male-narrated broadband peaky speech showed similar canonical morphologies but were smaller and later for female-narrated ABR responses (Figure 3A), much as they would be for click stimuli presented at higher rates (e.g., Burkard et al., 1990; Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977; Jiang et al., 2009). All component waves of the ABR and MLR were visible in the group average, although fewer subjects exhibited a clear wave III in the female-narrated response (9 versus all 11 subjects). The median (interquartile range) male-female correlation coefficients were 0.67 (0.60–0.77) for ABR lags of 0–15 ms with a 150 Hz high-pass filter, and 0.44 (0.35–0.61) for ABR/MLR lags of 0–40 ms with a 30 Hz high-pass filter (Figure 3B).

Figure 3.
  • Download figure
  • Open in new tab
Figure 3.

Comparison of responses to 32 minutes each of male (dark blue) and female (light blue) narrated re-synthesized broadband peaky speech. (A) Average waveforms across subjects (areas show ± 1 SEM) are shown for auditory brainstem response (ABR) time lags with high-pass filtering at 150 Hz (top), and both ABR and middle latency response (MLR) time lags with a lower high-pass filtering cutoff of 30 Hz (bottom). (B) Histograms of the correlation coefficients between responses evoked by male- and female-narrated broadband peaky speech during ABR (top) and ABR/MLR (bottom) time lags. Solid lines denote the median and dotted lines the inter-quartile range. (C) Comparison of ABR (top) and MLR (bottom) wave peak latencies for individual subjects (gray) and the group mean (black). ABR and MLR responses were similar to both types of input but are smaller for female-narrated speech, which has a higher glottal pulse rate. Peak latencies for female-evoked speech were delayed during ABR time lags but faster for early MLR time lags.

To determine if this stimulus dependence was significantly different than variability introduced by averaging only half the epochs (i.e., splitting by male- and female-narrated epochs), we reanalyzed the data split into even and odd epochs. Each of the even/odd splits contained the same number of male- and female-narrated epochs, and were evenly distributed over the entire recording session. The median (interquartile range) odd-even correlation coefficients were 0.86 (0.80–0.95) for ABR lags and 0.47 (0.36–0.80) for ABR/MLR lags. These odd-even coefficients were significantly higher than the male-female coefficients for the ABR (W(10) = 0.0, p < 0.001; Wilcoxon signed-rank test) but not when the response included the MLR (W(10) = 18.0, p = 0.206), indicating that the choice of narrator for using peaky speech impacts the morphology of the early response.

As expected from the waveforms, peak latencies of component waves differed between male- and female-narrated broadband peaky speech (Figure 3C). As before, ICC3 ≥ 0.83 indicated good agreement in peak wave choices (the lowest two 95% confidence intervals were 0.64–0.93 for Na and 0.92–0.99 for I). Mean ± SEM peak latency differences (female – male) for wave I, III, and V of the ABR were 0.19 ± 0.08 ms (t(10) = 2.15, p = 0.057, d = 0.68), 0.51 ± 0.09 ms (t(8) = 5.56, p < 0.001, d = 1.97), and 0.51 ± 0.11 ms (t(10) = 4.29, p = 0.002, d =1.36) respectively. Latency differences were earlier for female-evoked P0 (−1.27 ± 0.39 ms, t(10) = −3.11, p = 0.011, d = −0.98), but were not significant for later MLR peaks (Na: −0.86 ± 0.57 ms, t(8) = −1.40, p = 0.197, d = −0.50; Pa: −0.04 ± 0.44 ms, t(9) =-0.09, p = 0.933, d = −0.03).

Multiband peaky speech yields frequency-specific brainstem responses to speech

Frequency-specific responses show frequency-specific lags

Broadband peaky speech gives new insights into subcortical processing of naturalistic speech. Not only are brainstem responses used to evaluate processing at different stages of auditory processing, but ABRs can also be used to assess hearing function across different frequencies. Traditionally, frequency-specific ABRs are measured using clicks with high-pass masking noise or frequency-specific tone pips. We tested the flexibility of using our new peaky speech technique to investigate how speech processing differs across frequency regions, such as 0–1, 1–2, 2–4, and 4–8 kHz frequency bands. To do this, we created new pulses trains with slightly different fundamental waveforms for each filtered frequency regions of speech, and then combined those filtered frequency bands together as multiband speech (for details, see the Multiband peaky speech subsection of Methods). Using this method, we took advantage of the fact that over time, stimuli with slightly different fundamental frequencies will be independent, yielding independent auditory brainstem responses. Therefore, the same EEG was regressed with each band’s pulse train to derive the ABR and MLR to each frequency band.

Mean ± SEM responses from 22 subjects to the 4 frequency bands (0–1, 1–2, 2–4, and 4–8 kHz) of ~43 minutes of male-narrated multiband peaky speech are shown as colored waveforms with solid lines in Figure 4A. A high-pass filter with a cutoff of 30 Hz was used. Each frequency band response comprises a frequency-band-specific component as well as a band-independent common component, both of which are due to spectral characteristics of the stimuli and neural activity. The pulse trains are independent over time in the vocal frequency range – thereby allowing us to pull out responses to each different pulse train and frequency band from the same EEG – but they became coherent at frequencies lower than 72 Hz for the male-narrated speech and 126 Hz for the female speech (see Figure 13 in Methods). This coherence was due to all pulse trains beginning and ending together at the onset and offset of voiced segments and was the source of the low-frequency common component of each band’s response. To remove the common component, there are two options. First, we could simply high-pass the response at 150 Hz to filter out the regions of spectral coherence in the stimuli, as shown by the waveforms with dashed lines in Figure 4B. However, this method reduces the amplitude of the responses, which in turn affects response SNR and detectability. The second option is to calculate the common activity across the frequency band responses and subtract this waveform from each of the frequency band responses. This common component was calculated by regressing the EEG to multiband speech with 6 independent “fake” pulses trains – pulse trains with slightly different fundamental frequencies that were not used to create the multiband peaky speech stimuli that were presented during the experiment – and then averaging across these 6 responses. This common component waveform is shown by the dot-dashed gray line, which is superimposed with each response to the frequency bands in Figure 4A. The subtracted, frequency-specific waveforms to each frequency band are shown by the solid lines in Figure 4B. Of course, the subtracted waveforms could also be high-pass filtered at 150 Hz to highlight earlier waves of the brainstem responses, as shown by the dashed lines in Figure 4B. Overall, the frequency-specific responses showed characteristic ABR and MLR waves with longer latencies for lower frequency bands, as would be expected from responses arising from different cochlear regions. Also, waves I and III of the ABR were visible in the group average waveforms of the 2–4 kHz (≥41% of subjects) and 4–8 kHz (≥86% of subjects) bands, whereas the MLR waves were more prominent in the 0–1 kHz (≥95% of subjects) and 1–2 kHz (≥54% of subjects) bands.

Figure 4.
  • Download figure
  • Open in new tab
Figure 4.

Comparison of responses to ~43 minutes of male-narrated multiband peaky speech. (A) Average waveforms across subjects (areas show ± 1 SEM) are shown for each band (colored solid lines) and for the common component (dot-dash gray line, same waveform replicated as a reference for each band), which was calculated using 6 false pulse trains. (B) The common component was subtracted from each band’s response to give the frequency-specific waveforms (areas show ± 1 SEM), which are shown with high-pass filtering at 30 Hz (solid lines) and 150 Hz (dashed lines). (C) Mean ± SEM peak latencies for each wave decreased with increasing band frequency. Numbers of subjects with an identifiable wave are given for each wave and band.

These frequency-dependent latency changes for the frequency-specific responses are highlighted further in Figure 4C, which shows mean ± SEM peak latencies and the number of subjects who had a clearly identifiable wave. ICC3 ≥ 0.89 indicated good agreement in peak wave choices (lowest two 95% confidence intervals were 0.82–0.93 for Pa and 0.88–0.95 for Na). The nonlinear change in peak latency with frequency band was modeled using mixed effects regression by including orthogonal linear and quadratic terms for frequency band and their interactions with wave, as well as random effects of intercept and each frequency band term for each subject. A model was completed for each filter cutoff of 30 and 150 Hz. There were insufficient numbers of subjects with identifiable waves I and III for the 0 –1 kHz and 1–2 kHz bands, so these waves were not included in the full model. Details of each model are described in Supplemental Table 1. As expected, there were significantly different latencies for each MLR wave P0, Na and Pa compared to the ABR wave V (all effects of wave on the intercept p < 0.001, for each high-pass filter cutoff of 30 and 150 Hz). The significant decrease in latency with frequency band (linear term, slope: p < 0.001 for 30 and 150 Hz) was steeper (i.e., more negative) for MLR waves compared to the ABR wave V (all p < 0.001 for interactions between wave and the linear frequency band term for 30 and 150 Hz). The rate of latency decrease also changed significantly (quadratic frequency band term: p = 0.001 for 30 Hz and p < 0.001 for 150 Hz), but in a similar way for each component wave (all p > 0.091 for interactions between wave and the quadratic frequency band term, for 30 and 150 Hz models).

Next, the frequency-specific responses (i.e., multiband responses with common component subtracted) were summed and the common component added to derive the entire response to multiband peaky speech. As shown in Figure 5, this summed multiband response was strikingly similar in morphology to the broadband peaky speech. Both responses were high-passed filtered at 150 Hz and 30 Hz to highlight the earlier ABR waves and later MLR waves, respectively. The median (interquartile range) correlation coefficients from the 22 subjects were 0.90 (0.86–0.9) for 0–15 ms ABR lags, and 0.55 (0.46–0.74) for 0– 40 ms MLR lags. The similarity verifies that the frequency-dependent responses are truly independent from each other, and that these responses are complementary to the common component. If there were overlap in the cochlear regions, for example, the summed response would not resemble the broadband response to such a degree. The similarity also verified that the additional changes we made to create re-synthesized multiband peaky speech did not significantly affect responses compared to broadband peaky speech.

Figure 5.
  • Download figure
  • Open in new tab
Figure 5.

Comparison of responses to ~43 minutes of male-narrated peaky speech in the same subjects. Average waveforms across subjects (areas show ± 1 SEM) are shown for broadband peaky speech (blue) and for the summed frequency-specific responses to multiband peaky speech with the common component added (red), high-pass filtered at 150 Hz (left) and 30 Hz (right). Regressors in the deconvolution were pulse trains.

Frequency-specific responses also differ by narrator

We also investigated the effects of narrator on multiband peaky speech by deriving responses to 32 minutes (30 epochs of 64 s each) each of male- and female-narrated multiband peaky speech in the same 11 subjects. As with broadband peaky speech, responses to both narrators showed similar morphology, but the responses were smaller and the MLR waves more variable for the female than male narrator (Figure 6A). Figure 6B shows the male-female correlation coefficients for responses between 0–40 ms with a high-pass filter of 30 Hz and between 0–15 ms with a high-pass filter of 150 Hz. The median (interquartile range) male-female correlation coefficients were better for higher frequency bands, ranging from 0.20 (−0.04–0.29) for the 1–2 kHz band to 0.38 (0.27–0.48) for the 4–8 kHz band for MLR lags (Figure 6B, left panel), and from 0.57 (0.29–0.68) for the 0–1 kHz band to 0.81 (0.70–0.84) for the 4–8 kHz band for ABR lags (Figure 6B, right panel). These male-female correlation coefficients were significantly weaker than those of the same EEG split into even and odd trials for all but the 2–4 kHz frequency band when responses were high-pass filtered at 30 Hz and correlated across 0–40 ms lags (2–4 kHz: W(10) = 21.0, p = 0.320; other bands: W(10) ≤ 8.0, p ≤ 0.024), but were similar to the even/odd trials for responses from all frequency bands high-pass filtered at 150 Hz (W(10) ≥ 13.0, p ≥ 0.083). These results indicate that the specific narrator can affect the robustness of frequency-specific responses, particularly for the MLR waves.

Figure 6.
  • Download figure
  • Open in new tab
Figure 6.

Comparison of responses to 32 minutes each of male- and female-narrated re-synthesized multiband peaky speech. (A) Average frequency-specific waveforms across subjects (areas show ± 1 SEM; common component removed) are shown for each band in response to male- (dark red lines) and female-narrated (light red lines) speech. Responses were high-pass filtered at 30 Hz (left) and 150 Hz (right) to highlight the MLR and ABR respectively. (B) Correlation coefficients between responses evoked by male- and female-narrated multiband peaky speech during ABR/MLR (left) and ABR (right) time lags for each frequency band. Black lines denote the median. (C) Mean ± SEM peak latencies for male- (dark) and female- (light) narrated speech for each wave decreased with increasing frequency band. Numbers of subjects with an identifiable wave are given for each wave, band and narrator. Lines are given a slight horizontal offset to make the error bars easier to see.

As expected from the grand average waveforms and male-female correlations, there were fewer subjects who had identifiable waves across frequency bands for the female-than male-narrated speech. These numbers are shown in Figure 6C, along with the mean ± SEM peak latencies for each wave, frequency band and narrator. ICC3 ≥0.93 indicated good agreement in peak latency choices (the two lowest 95% confidence intervals were 0.83–0.97 for wave III and 0.98–0.99 for Pa, both with a high-pass filter cutoff of 30 Hz). Again, there were few numbers of subjects with identifiable waves I and III for the lower frequency bands. Therefore, the mixed effects model was completed for waves V, P0, Na, and Pa of responses in the 4 frequency bands that were high-pass filtered at 30 Hz. The model included fixed effects of narrator, wave, linear and quadratic terms for frequency band, the interaction between narrator and wave, and the interactions between wave and frequency band terms, as well as a random intercept per subject and random frequency band terms for per subject. Details of the model are described in Supplemental Table 2. For those subjects with identifiable waves, peak latencies shown in Figure 6C differed by wave (p < 0.001 for effects of each wave on the intercept), and latency decreased with increasing frequency band (p < 0.001 for the linear term, slope). This change with frequency was greater (i.e., steeper slope) for each MLR wave compared to wave V (p < 0.013 for all interactions between wave and the linear term for frequency band). There was no change in slope with band (p = 0.190 for the quadratic term, p > 0.318 for the interactions between wave and the quadratic term). There was also no main effect of narrator on peak latencies (narrator p = 0.481), except that the latencies for Pa were faster for the female than male narrator (Pa–narrator interaction p = 0.003; other wave-narrator interactions p > 0.195). Therefore, as with broadband peaky speech, frequency-specific peaky responses were more robust with the male narrator, but unlike the broadband responses, the frequency-specific responses did not peak earlier for a narrator with a lower fundamental frequency.

Frequency-specific responses can be measured simultaneously in each ear (dichotically)

The focus so far has been on scientific use of peaky speech but there are also potential clinical applications. Frequency-specific ABRs to tone pips are traditionally used to assess hearing function in each ear across octave bands with center frequencies of 500–8000 Hz. Applying the same principles to generate multiband peaky speech, we investigated whether ear-specific responses could be evoked across 5 standard, clinically-relevant (audiological) frequency bands using dichotic multiband speech. For peaky dichotic (stereo) audiological multiband speech we created 10 independent pulse trains, 2 for each ear in each of the 5 frequency bands (see Multiband peaky speech and Band filters in Methods).

We recorded responses to 64 minutes (60 epochs of 64 s each) each of male- and female-narrated dichotic multiband peaky speech in 11 subjects. The frequency-specific (i.e., common component-subtracted) group average waveforms for each ear and frequency band are shown in Figure 7A. The ten waveforms were small, especially for female-narrated speech, but a wave V was identifiable for both narrators. MLR waves were not clearly identifiable for responses to female-narrated speech. Therefore, correlations between responses were performed for ABR lags between 0–15 ms. As shown in Figure 7B, the median (interquartile range) left-right ear correlation coefficients (averaged across narrators) ranged from 0.17 (−0.12–0.49) for the 0.5 kHz band to 0.63 (0.33–0.86) for the 8 kHz band. Male-female correlation coefficients (averaged across ear) ranged from 0.08 (−0.25–0.22) for the 0.5 kHz band to 0.70 (0.26–0.80) for the 4 kHz band. Although the female-narrated responses were smaller than the male-narrated responses, these male-female coefficients did not significantly differ from the left-right ear coefficients (W(10) ≥ 20.0, p ≥ 0.278), or from correlations of same EEG split into even-odd trials and averaged across ear (W(10) ≥ 20.0, p ≥ 0.278), likely reflecting the variability in such small responses.

Figure 7.
  • Download figure
  • Open in new tab
Figure 7.

Comparison of responses to ~60 minutes each of male- and female-narrated dichotic multiband peaky speech with standard audiological frequency bands. (A) Average frequency-specific waveforms across subjects (areas show ± 1 SEM; common component removed) are shown for each band for the left ear (dotted lines) and right ear (solid lines). Responses were high-pass filtered at 30 Hz. (B) Left-right ear correlation coefficients (top, averaged across gender) and male-female correlation coefficients (bottom, averaged across ear) during ABR time lags (0–15 ms) for each frequency band. Black lines denote the median. (C) Mean ± SEM wave V latencies for male- (dark red) and female-narrated (light red) speech for the left (dotted line, cross symbol) and right ear (solid line, circle symbol) decreased with increasing frequency band. Lines are given a slight horizontal offset to make the error bars easier to see.

Figure 7C shows the mean ± SEM peak latencies of wave V for each ear and frequency band for the male- and female-narrated dichotic multiband peaky speech. The ICC3 for wave V was 0.98 (95% confidence interval 0.98–0.99), indicating reliable peak latency choices. The nonlinear change in wave V latency with frequency was modeled using mixed effects regression with fixed effects of narrator, ear, linear and quadratic terms for frequency band, and the interactions between narrator and frequency band terms. Random effects included an intercept and both frequency band terms for each subject. Details of the model are described in Supplemental Table 3. Wave V latency was significantly longer for female-than male-narrated multiband peaky speech in the 0.5 kHz band (narrator effect on the intercept, p = 0.001), decreased at a steeper rate with frequency band (interaction between narrator and linear frequency band term p < 0.001), and had a significantly different rate of change with frequency (interaction between narrator and quadratic frequency band term, p < 0.001). Overall, latency did not differ between ears (p = 0.116). Taken together, these results confirm that, while small in amplitude, frequency-specific responses can be elicited in both ears across 5 different frequency bands and show characteristic latency changes across the different frequency bands.

Responses are obtained quickly for male-narrated broadband peaky speech but not multiband speech

Having demonstrated that peaky broadband and multiband speech provides canonical waveforms with characteristic changes in latency with frequency, we next evaluated the acquisition time required for waveforms to reach a decent SNR. We chose 0 dB SNR based on visual assessment of when waveforms were easily inspected and based on what we have done previously (Maddox and Lee, 2018; Polonenko and Maddox, 2019). SNR was calculated by comparing the variance in the MLR time interval 0–30 ms (for responses high-pass filtered at 30 Hz) or ABR time interval 0–15 ms (for responses high-pass filtered at 150 Hz) to the variance in the pre-stimulus noise interval −480 to −20 ms (see Response SNR calculation in Methods for details).

Figure 8 shows the cumulative proportion of subjects who had responses with ≥ 0 dB SNR to unaltered and broadband peaky speech as a function of recording time. Acquisition times for 22 subjects were similar for responses to both unaltered and broadband peaky male-narrated speech, with 0 dB SNR achieved by 8 minutes in 50% of subjects and by 18 and 20 minutes respectively in 100% of subjects. This time reduced to 2 and 5 minutes for 50% and 100 % of subjects respectively for broadband peaky responses high-pass filtered at 150 Hz to highlight the ABR (0–15 ms interval). These times for male-narrated broadband peaky speech were confirmed in our second cohort of 11 subjects, who also all achieved 0 dB SNR within 26 minutes for the 0–30 ms MLR interval (10 / 11 subjects in 18 minutes; 50% by 10 minutes) and 4 minutes for the 0–15 ms ABR interval (50% by 2 minutes). However, acquisition times were at least 3.6 times – but up to over 10 times – longer for female-narrated broadband peaky speech, with 50% of subjects achieving 0 dB SNR by 36 minutes for the MLR interval and 8 minutes for ABR interval. In contrast to male-narrated speech, not all subjects achieved this threshold for female-narrated speech by the end of the 32-minute recording (45% and 63% for the MLR and ABR intervals respectively). Taken together, these acquisition times confirm that responses with useful SNRs can be measured quickly for male-narrated broadband peaky speech but longer recording sessions are necessary for narrators with higher fundamental frequencies.

Figure 8.
  • Download figure
  • Open in new tab
Figure 8.

Cumulative proportion of subjects who have responses with ≥ 0 dB SNR as a function of recording time. Time required for unaltered (black) and broadband peaky speech (dark blue) of a male narrator are shown for 22 subjects in the left plot, and for male (dark blue) and female (light blue) broadband peaky speech is shown for 11 subjects in the right plot. Solid lines denote SNRs calculated using variance of the signal high-pass filtered at 30 Hz over the ABR/MLR interval 0–30 ms, and dashed lines denote SNR variances calculated on signals high-pass filtered at 150 Hz over the ABR interval 0–15 ms. Noise variance was calculated in the pre-stimulus interval −480 to −20 ms.

The longer recording times necessary for a female narrator became more pronounced for the multiband peaky speech. Figure 9A shows the cumulative density function for responses high-pass filtered at 150 Hz and the SNR estimated over the ABR interval. Most subjects (72%) had frequency-specific responses (common component subtracted) with ≥ 0 dB SNR for all 4 frequency bands by the end of the 32-minute recording for the male-narrated speech, but this was achieved in only 45% of subjects for the female-narrated speech. Multiband peaky speech required significantly longer recording times than broadband peaky speech, with 50% of subjects achieving 0 dB SNR by 22 minutes compared to 2 minutes for the male-narrated responses across the ABR 0–15 ms interval and 23 minutes compared to 5 minutes for the MLR 0–30 ms interval. For the MLR interval, 72% and 18% of subjects had reached 0 dB SNR by the 32-minute recording for male- and female-narrated speech respectively. Even more time was required for dichotic multiband speech, which was comprised of a larger number of frequency bands (Figure 9B). All 10 audiological band responses achieved ≥0 dB SNR in 36% of ears (8 / 22 ears from 11 subjects) by 64 minutes for male-narrated speech and in 22% of ears (5 / 22 ears) for female-narrated speech. The smaller and broader responses in the low frequency bands limited this testing time – for male-narrated speech, at least 90% of subjects had 2–4 and 4–8 kHz responses (diotic 4-band speech) with ≥ 0 dB SNR in 15 minutes, 70% of subjects had 6 frequency-specific responses (2, 4, 8 kHz bands in both ears for dichotic speech) by the end of the recording, and 50% of subjects had the 6 higher frequency-specific responses within 40 minutes. These significant recording times suggest that deriving multiple frequency-specific responses will require at least more than 30 minutes per condition for < 5 bands, and more than an hour session for one condition of peaky multiband speech with 10 bands.

Figure 9.
  • Download figure
  • Open in new tab
Figure 9.

Cumulative proportion of subjects who have frequency-specific responses (common component subtracted) with ≥ 0 dB SNR as a function of recording time. Acquisition time was faster for male (left) than female (right) narrated multiband peaky speech with (A) 4 frequency bands presented diotically, and with (B) 5 frequency bands presented dichotically (total of 10 responses, 5 bands in each ear). SNR was calculated by comparing variance of signals high-pass filtered at 150 Hz across the ABR interval of 0–15 ms to variance of noise in the pre-stimulus interval −480 to −20 ms.

DISCUSSION

The major goal of this work was to develop a method to investigate early stages of naturalistic speech processing. We re-synthesized continuous speech taken from audio books so that the phases of all harmonics aligned at each glottal pulse during voiced segments, thereby making speech as impulse-like (peaky) as possible to drive the auditory brainstem. Then we used the glottal pulse trains as the regressor in deconvolution to derive the responses. Indeed, comparing waveforms to broadband peaky and unaltered speech validated the superior ability of peaky speech to evoke additional waves of the canonical ABR and MLR, reflecting neural activity from multiple subcortical structures. Robust ABR and MLR responses were recorded in less than 5 and 20 minutes respectively for all subjects, with half of the subjects exhibiting a strong ABR within 2 minute and MLR within 8 minutes. Longer recording times were required for the smaller responses generated by a narrator with a higher fundamental frequency. We also demonstrated the flexibility of this stimulus paradigm by simultaneously recording up to 10 frequency-specific responses to multiband peaky speech that was presented either diotically or dichotically, although these responses required much longer recording times. Taken together, our results show that peaky speech effectively yields responses from distinct subcortical structures and from different frequency bands, paving the way for new investigations of speech processing and new tools for clinical application.

For the purpose of investigating responses from different subcortical structures, we accomplished our goal of creating a stimulus paradigm that overcame some of the limitations of current methods using natural speech. Methods that do not use re-synthesized impulse-like speech generate responses characterized by a broad peak between 6–9 ms (Forte et al., 2017; Maddox and Lee, 2018), with contributions predominantly from the inferior colliculus (Saiz-Alia and Reichenbach, 2020). In contrast, for the majority of our subjects, peaky speech evoked responses with canonical morphology comprised of waves I, III, V, P0, Na, Pa (Figure 1), reflecting neural activity from distinct stages of the auditory system from the auditory nerve to thalamus and primary auditory cortex (e.g., Picton et al., 1974). Presence of these additional waves allows for new investigations into the contributions of each of these neural generators to speech processing while using a continuously dynamic and ecologically salient stimulus.

The same ABR waves evoked here were also evoked by a method using embedded chirps intermixed within alternating octave bands of speech, particularly if presented monaurally over headphones instead of in free field (Backer et al., 2019; Miller et al., 2017). Chirps are transients that compensate for the cochlear traveling delay wave by introducing different phases across frequency, leading to a more synchronized response across the cochlea and a larger brainstem response than for clicks (Dau et al., 2000; Elberling and Don, 2008; Shore and Nuttall, 1985). The responses to embedded chirps elicited waves with larger mean amplitude than those to our broadband peaky speech (~0.4 versus ~0.2 μV, respectively), although a similar proportion of subjects had identifiable waves and several other factors may contribute to amplitude differences. For example, higher click rates (e.g., Burkard et al., 1990; Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977; Jiang et al., 2009) and higher fundamental frequencies (Maddox and Lee, 2018; Saiz-Alía et al., 2019; Saiz-Alia and Reichenbach, 2020) reduce the brainstem response amplitude, and dynamic changes in rate may create interactions across neural populations that lead to smaller amplitudes. Our stimuli kept the dynamic changes in pitch across all frequencies (instead of alternate octave bands of chirps and speech) and created impulses at every glottal pulse, with an average pitch of ~115 Hz and ~198 Hz for the male and female narrators respectively. These presentation rates were much higher and more variable than the flat 42 Hz rate at which the embedded chirps were presented (pitch flattened to 82 Hz and chirps presented every other glottal pulse). We could evaluate whether chirps would improve response amplitude to our dynamic peaky speech by simply all-pass filtering the re-synthesized voiced segments by convolving with a chirp prior to mixing the re-synthesized parts with the unvoiced segments. While maintaining the amplitude spectrum of speech, the harmonics would then have the different phases associated with chirps at each glottal pulse instead of all phases set to 0. Regardless, our peaky speech generated robust canonical responses with good SNR while maintaining a natural-sounding, if very slightly “buzzy”, quality to the speech. Overall, continuous speech re-synthesized to contain impulse-like characteristics is an effective way to elicit responses that distinguish contributions from different subcortical structures.

The latencies of the component waves of the responses to peaky speech are consistent with activity arising from known subcortical structures. The inter-wave latencies between I-III, III-V and I-V fall within the expected range for brainstem responses elicited by transients at 50–60 dB sensation level (SL) and 50– 100 Hz rates (Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977), suggesting the transmission times between auditory nerve, cochlear nucleus and rostral brainstem remain similar for speech stimuli. However, these speech-evoked waves peak at later absolute latencies than responses to transient stimuli at 60 dB SL and 90–100 Hz, but at latencies more similar to those presented at 50 dB SL or 50 dB nHL in the presence of some masking noise (Backer et al., 2019; Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977; Maddox and Lee, 2018; Miller et al., 2017). There are a couple of reasons why the speech-evoked latencies may be later. First, our level of 60 dB sound pressure level (SPL) may be more similar to click levels of 50 dB SL. Second, although spectra of both speech and transients are broad, clicks, chirps and even our previous speech stimuli (which was high-pass filtered at 1 kHz; Maddox and Lee, 2018) have relatively greater high-frequency energy than the unaltered and peaky broadband speech used in the present work. Neurons with higher characteristic frequencies respond earlier due to their basal cochlear location, and contribute relatively more to brainstem responses (e.g., Abdala and Folsom, 1995), leading to quicker latencies for stimuli that have greater high frequency energy. Also consistent with having greater lower frequency energy, our unaltered and peaky speech responses were later than the response from the same speech segments that were high-pass filtered at 1 kHz (Maddox and Lee, 2018). In fact, the ABR to broadband peaky speech bore a close resemblance to the summation of each frequency-specific response and the common component to peaky multiband speech (Figure 5), with peak wave latencies representing the relative contribution of each frequency band. Third, higher stimulation rates prolong latencies due to neural adaptation, and the 115–198 Hz average fundamental frequencies of our speech were much higher than the 41 Hz embedded chirps and 50–100 Hz click rates (e.g., Burkard et al., 1990; Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977; Jiang et al., 2009). The effect of stimulation rate was also demonstrated by the later ABR wave I, III, and V peak latencies for the female narrator with the higher average fundamental frequency of 198 Hz (Figure 3A&C). Fourth, continuous speech is dynamic with much greater variability in pitch than the variability in presentation rate of typical period stimulation with clicks, chirps or short syllables. Across all our 64 second speech segments, the average standard deviation in pitch was 28.5 and 50.6 Hz for the male and female narrators respectively, with pitch varying from a minimum of 61 and 91 Hz to a maximum of 288 and 456 Hz, respectively. This variability in rate over a continuous stimulus may create interactions across different neural populations that may delay responses compared to the relatively regular transient stimuli. Therefore, the differing characteristics of typical periodic transients (such as clicks and chirps) and continuous speech may give rise to differences in brainstem responses, even though they share canonical waveforms arising from similar contributing subcortical structures.

Latency of the peaky speech-evoked response also differed from the non-standard, broad responses to unaltered speech. However, latencies from these waveforms are difficult to compare due to the differing morphology and the different analyses that were used to derive the responses. Evidence for the effect of analysis comes from the fact that the same EEG collected in response to peaky speech could be regressed with pulse trains to give canonical ABRs (Figures 1, 2), or regressed with the half-wave rectified peaky speech to give the different, broad waveform (Supplemental Figure 1). Furthermore, non-peaky continuous speech stimuli with similar ranges of fundamental frequencies (between 100–300 Hz) evoke non-standard, broad brainstem responses that also differ in morphology and latency depending on whether the EEG is analyzed by deconvolution with the half-wave rectified speech (Figure 2, Maddox and Lee, 2018) or complex cross-correlation with the fundamental frequency waveform (Forte et al., 2017). Therefore, again, even though the inferior colliculus and lateral lemniscus may contribute to generating these different responses (Møller and Jannetta, 1983; Saiz-Alia and Reichenbach, 2020; Starr and Hamilton, 1976), the morphology and latency may differ (sometimes substantially) depending on the analysis technique used.

In addition to evoking canonical brainstem responses, peaky speech can be exploited for other traditional uses of ABR, such as investigating subcortical responses across different frequencies. Frequency-specific responses were measurable to two different types of multiband peaky speech: 4 frequency bands presented diotically (Figures 4, 6), and 5 frequency bands presented dichotically (Figure 7). Peak wave latencies of these responses decreased with increasing band frequency in a similar way to responses evoked by tone pips (Gorga et al., 1988; Rasetshwane et al., 2013), thereby representing activity evoked from different areas across the cochlea. Interestingly, the frequency-specific responses across frequency band were similar in amplitude (Figures 4, 6, 7) even though the relative energy of each band decreased with increasing frequency, resulting in a ~30 dB difference between the lowest and highest frequency bands (Figure 12). A greater response elicited by higher frequency bands is consistent with the relatively greater contribution of neurons with higher characteristic frequencies to ABRs (Abdala and Folsom, 1995), as well as the need for higher levels to elicit low frequency responses to tone pips that are close to threshold (Gorga et al., 2006, 1993; Hyde, 2008; Stapells and Oates, 1997). Also, canonical waveforms were derived in the higher frequency bands of diotically presented speech, with waves I and III identifiable in most subjects. Measuring waves I, III, and V of high frequency responses may have applications to studying cochlear synaptopathy (Liberman et al., 2016) using naturalistic speech in humans. Another exciting application is the evaluation of supra-threshold hearing across frequency in toddlers and individuals who do not provide reliable behavioral responses, as they may be more responsive to sitting for longer periods of time while listening to a narrated story than to a series of tone pips. An extension of this assessment would be to evaluate neural speech processing in the context of hearing loss, as well as rehabilitation strategies such as hearing aids and cochlear implants. Therefore, the ability of peaky speech to yield both canonical waveforms and frequency-specific responses makes this paradigm a flexible method that assesses speech processing in new ways.

Having established that peaky speech is a flexible stimulus for investigating different aspects of speech processing, there are several practical considerations for using the peaky speech paradigm. First, filtering should be performed carefully. As recommended in Maddox and Lee (2018), causal filters – which have impulse responses with non-zero values at positive lags – should be used to ensure cortical activity at later peak latencies does not spuriously influence earlier peaks corresponding to subcortical origins. Applying less aggressive, low-order filters (i.e., broadband with shallow roll-offs) will help reduce the effects of causal filtering on delaying response latency. The choice of high-pass cutoff will also affect the response amplitude and morphology. After evaluating several orders and cutoffs to the high-pass filters, we determined that early waves of the peaky broadband ABRs were best visualized with a 150 Hz cutoff, whereas a lower cutoff frequency of 30 Hz was necessary to view the ABR and MLR of the broadband responses. For multiband responses, the 150 Hz high-pass filter significantly reduced the response but also decreased the low-frequency noise in the pre-stimulus interval. For the 4-band multiband peaky speech the 150 Hz and 30 Hz filters provided similar acquisition times for 0 dB SNR, but better SNRs were obtained quicker with 150 Hz filtering for the 10-band multiband peaky speech.

Second, the choice of narrator impacts the responses to both broadband and multiband peaky speech. Although overall morphology was similar, the male-narrated responses were larger, contained more clearly identifiable component waves in a greater proportion of subjects, and achieved a 0 dB SNR at least 3.6 to over 10 times faster than those evoked by a female narrator. These differences likely stemmed from the ~77 Hz difference in average pitch, as higher stimulation rates evoke smaller responses due to adaptation and refractoriness (e.g., Burkard et al., 1990; Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977; Jiang et al., 2009). Indeed, a 50 Hz change in fundamental frequency yields a 24% reduction in the modelled auditory brainstem response that was derived as the complex cross-correlation with the fundamental frequency (Saiz-Alia and Reichenbach, 2020). The narrator differences exhibited in the present study may be larger than those in other studies with continuous speech (Forte et al., 2017; Maddox and Lee, 2018; Saiz-Alía et al., 2019) as a result of the different regressors. These response differences do not preclude using narrators with higher fundamental frequencies in future studies, but the time required for usable responses from each narrator must be considered when planning experiments, and caution taken when interpreting comparisons between conditions with differing narrators. The strongest results will come from comparing responses to the same narrator (or even the same speech recordings) under different experimental conditions.

Third, the necessary recording time depends on the chosen SNR threshold, experimental demands, and stimulus. We chose a threshold SNR of 0 dB based on when waveforms became clearly identifiable, but of course a different threshold would change our recording time estimates (though, notably, not the ratios between them). With this SNR threshold, acquisition times were quick enough for broadband peaky responses to allow multiple conditions in a reasonable recording session. With male-narrated broadband peaky speech, all subjects achieved 0 dB SNR ABRs in < 5 minutes and MLRs in < 20 minutes, thereby affording between 3 and 12 conditions in an hour recording session. These recording times are comparable, if not faster, than the 8 minutes for the broad response to unaltered speech, 6–12 minutes for the chirp-embedded speech (Backer et al., 2019), ~10 minutes for the broad complex-cross correlation response to the fundamental waveform (Forte et al., 2017), and 33 minutes for the broad response to high-passed continuous speech (Maddox and Lee, 2018). However, using a narrator with a higher fundamental frequency could increase testing time by 3-to over 10-fold. In this experiment, at most 2 conditions per hour could be tested with the female-narrated broadband peaky speech. Furthermore, longer testing times are likely needed, even for male-narrated speech, in order to reliably compare differences in the smaller amplitude component waves I and III of the ABR. Our 30-to 40-minute recording sessions provided robust responses with very good SNRs to evaluate the earlier ABR waves, but this long may be unnecessary. The cumulative density functions in Figure 8 suggest that between 12 to 20 minutes should constitute ample time to generate comparable responses with highly positive SNRs. Unlike broadband peaky speech, the testing times required for all frequency-specific responses to reach 0 dB SNR were significantly longer, making only 1 condition feasible within a recording session. At least 30 minutes was necessary for the diotically presented multiband peaky speech with 4 frequency bands, but based on extrapolated testing times, about 56–88 minutes is required for 90% of subjects to achieve this threshold for all 4 bands. For dichotically presented multiband peaky speech with 5 frequency bands (for a total of 10 frequency-specific waveforms), only 36% the responses achieved 0 dB SNR within an hour. Extrapolated testing times suggest that over 2 hours is required for at least 75% of subjects, limiting the feasibility or utility of multiband peaky speech with several frequency bands.

Fourth, as mentioned above, the number of frequency bands incorporated into multiband peaky speech decreases SNR and increases testing time. Although it is possible to simultaneously record up to 10 frequency-specific responses, the significant time required to obtain decent SNRs reduces the feasibility of testing multiple conditions or having recording sessions lasting less than 1–2 hours. However, pursuing shorter testing times with multiband peaky speech is possible. Depending on the experimental question, different multiband options could be considered. For male-narrated speech, the 2–4 and 4–8 kHz responses had good SNRs and exhibited waves I, III, and V within 15 minutes for 90% of subjects. Therefore, if researchers were more interested in comparing responses in these higher frequency bands, they could stop recording once these bands reach threshold but before the lower frequency bands reach criterion (i.e., within 15 minutes). Alternatively, the lower frequencies could be combined into a single broader band in order to reduce the total number of bands, or the intensity could be increased to evoke responses with larger amplitudes. Therefore, different band and parameter considerations could reduce testing time and improve the feasibility, and thus utility, of multiband peaky speech.

Fifth, and finally, a major advantage of deconvolution analysis is that the analysis window for the response can be extended arbitrarily in either direction to include a broader range of latencies (Maddox and Lee, 2018). Extending the pre-stimulus window leftward provides a better estimate of the SNR, and extending the window rightward allows parts of the response that come after the ABR and MLR to be analyzed as well, which are driven by the cortex. These later responses can be evaluated in response to broadband peaky speech, but as shown in Figures 6 and 7, only ABR and early MLR waves are present in the frequency-specific responses. The same broadband peaky speech data from Figure 3 are displayed with an extended time window in Figure 10, which shows component waves of the ABR, MLR and late latency responses (LLR). Thus, this method allows us to simultaneously investigate speech processing ranging from the earliest level of the auditory nerve all the way through the cortex without requiring extra recording time. Usually the LLR is larger than the ABR/MLR, but our subjects were encouraged to relax and rest, yielding a passive LLR response. Awake and attentive subjects may improve the LLR; however, other studies that present continuous speech to attentive subjects also report smaller and different LLR (Backer et al., 2019; Maddox and Lee, 2018), possibly from cortical adaptation to a continuous stimulus. Here we used a simple 2-channel montage that is optimized for recording ABRs, but a full multi-channel montage could also be used to more fully explore the interactions between subcortical and cortical processing of naturalistic speech. The potential for new knowledge about how the brain processes naturalistic and engaging stimuli cannot be undersold.

Figure 10.
  • Download figure
  • Open in new tab
Figure 10.

The range of lags can be extended to allow early, middle and late latency responses to be analyzed from the same recording to broadband peaky speech. Average waveforms across subjects (areas show ± 1 SEM) are shown for responses measured to 32 minutes of broadband peaky speech narrated by a male (dark blue) and female (light blue). Responses were high-pass filtered at 30 Hz using a first order Butterworth filter, but different filter parameters can be used to focus on each stage of processing. Canonical waves of the ABR, MLR and LLR are labeled for the male-narrated speech. Due to adaptation, amplitudes of the late potentials are smaller than typically seen with other stimuli that are shorter in duration with longer inter-stimulus intervals than our continuous speech. Waves I and III become more clearly visible by applying a 150 Hz high-pass cutoff.

The flexibility of peaky speech to evoke broadband and frequency-specific responses from distinct subcortical structures facilitates new lines of query, both in neuroscientific and clinical domains. Speech often occurs within a mixture of sounds, such as other speech sources, background noise, or music. Furthermore, visual cues from a talker’s face are often available to aid speech understanding, particularly in environments with low SNR (e.g., Bernstein and Grant, 2009; Grant et al., 2007). Peaky speech allows investigation into the complex subcortical processing that underpins successful listening in these scenarios using naturalistic, engaging tasks. Indeed, previous methods have been quite successful in elucidating cortical processing of speech under these conditions (O’Sullivan et al., 2019; Teoh and Lalor, 2019). Finally, as aforementioned, the ability to customize peaky speech for measuring frequency-specific responses provides potential applications to clinical research in the context of facilitating assessment of supra-threshold hearing function and changes following intervention strategies and technologies.

In summary, the peaky speech paradigm is a viable method for recording canonical waveforms and frequency-specific responses to an engaging, continuous speech stimulus. The customizability and flexibility of peaky speech facilitates new ways of investigating subcortical contributions to speech processing and holds great potential for future implementation into clinical assessment of hearing function.

METHODS

Participants

Data were collected over 3 experiments that were conducted under a protocol approved by the University of Rochester Research Subjects Review Board. All subjects gave informed consent before the experiment began and were compensated for their time. In each of experiments 1 and 2, there were equipment problems during testing for one subject, rendering data unusable in the analyses. Therefore, there were a total of 22, 11, and 11 subjects included in experiments 1, 2, and 3 respectively. Four subjects completed both experiments 1 and 2, and 2 subjects completed both experiments 2 and 3. The 38 unique subjects (25 females, 66%) were aged 18–32 years with a mean ± SD age of 23.0 ± 3.6 years. Audiometric screening confirmed subjects had normal hearing in both ears, defined as thresholds ≤ 20 dB HL from 250 to 8000 Hz. All subjects identified English as their primary language.

Stimulus presentation and EEG measurement

In each experiment, subjects listened to 128 minutes of continuous speech stimuli while reclined in a darkened sound booth. They were not required to attend to the speech and were encouraged to relax and to sleep. Speech was presented at an average level of 65 dB SPL over ER-2 insert earphones (Etymotic Research, Elk Grove, IL) plugged into an RME Babyface Pro digital soundcard (RME, Haimhausen, Germany) via an HB7 headphone amplifier (Tucker Davis Technologies, Alachua, FL). Stimulus presentation was controlled by a custom python script using publicly available software (available at https://github.com/LABSN/expyfun; Larson et al., 2014). We interleaved conditions in order to prevent slow impedance drifts or transient periods of higher EEG noise from unevenly affecting one condition over the others. Physical measures to reduce stimulus artifact included: 1) hanging earphones from the ceiling so that they were as far away from the EEG cap as possible; and 2) sending an inverted signal to a dummy earphone (blocked tube) attached in the same physical orientation to the stimulus presentation earphones in order to cancel electromagnetic fields away from transducers. The soundcard also produced a digital signal at the start of each epoch, which was converted to trigger pulses through a custom trigger box (modified from a design by the National Acoustic Laboratories, Sydney, NSW, Australia) and sent to the EEG system so that audio and EEG data could be synchronized with sub-millisecond precision.

EEG was recorded using BrainVision’s PyCorder software. Ag/AgCl electrodes were placed at the high forehead (FCz, active non-inverting), left and right earlobes (A1, A2, inverting references), and the frontal pole (Fpz, ground). These were plugged into an EP-Preamp system specifically for recording ABRs, connected to an ActiCHamp recording system, both manufactured by BrainVision. Data were sampled at 10,000 Hz and high-pass filtered at 0.1 Hz. Offline, raw data were high-pass filtered at 1 Hz using a first-order causal Butterworth filter to remove slow drift in the signal, and then notch filtered with 5 Hz wide second-order infinite impulse response (IIR) notch filters to remove 60 Hz and its first 3 odd harmonics (180, 300, 420 Hz). To optimize parameters for viewing the ABR and MLR components of peaky speech responses, we evaluated several orders and high-pass cutoffs to the filters. Early waves of the broadband peaky ABRs were best visualized with a 150 Hz cutoff, whereas a lower cutoff frequency of 30 Hz was necessary to view the ABR and MLR of the broadband responses. Conservative filtering with a first order filter was sufficient with these cutoff frequencies.

Speech stimuli and conditions

Speech stimuli were taken from two audiobooks. The first was The Alchemyst (Scott, 2007), read by a male narrator and used in all 3 experiments. The second was A Wrinkle in Time (L’Engle, 2012), read by a female narrator and used in experiments 2 and 3. These stimuli were used in Maddox and Lee (2018), but in that study a gentle high-pass filter was applied which was not done for this study. Briefly, the audiobooks were resampled to 44,100 Hz and then silent pauses were truncated to 0.5 s. Speech was segmented into 64 s epochs with 1 s raised cosine fade-in and fade-out. Because conditions were interleaved, the last 4 s of a segment were repeated in the next segment so that subjects could pick up where they left off if they were listening.

In experiment 1 subjects listened to 3 conditions of male speech (42.7 min each): unaltered speech, re-synthesized broadband peaky speech, and re-synthesized multiband peaky speech (see below for a description of re-synthesized speech). In experiment 2 subjects listened to 4 conditions of re-synthesized peaky speech (32 minutes each): male and female narrators of both broadband and multiband peaky speech. For these first 2 experiments, speech was presented diotically (same speech to both ears). In experiment 3 subjects listened to both male and female dichotic (different speech in each ear) multiband peaky speech designed for audiological applications (64 min each). The same 64 s of speech was presented simultaneously to each ear, but the stimuli were dichotic due to how the re-synthesized multiband speech was created (see below).

Stimulus design

The brainstem responds best to impulse-like stimuli, so we re-synthesized the speech segments from the audiobooks (termed “unaltered”) to create 3 types of “peaky” speech, with the objectives of 1) evoking additional waves of the ABR reflecting other neural generators, and 2) measuring responses to different frequency regions of the speech. The process is described in detail below, but is best read in tandem with the code that will be publicly available (https://github.com/maddoxlab). Figure 11 compares the unaltered speech and re-synthesized broadband and multiband peaky speech. Comparing the pressure waveforms shows that the peaky speech is as click-like as possible, but comparing the spectrograms (how sound varies in amplitude at every frequency and time point) shows that the overall spectrotemporal content that defines speech is basically unchanged by the re-synthesis. See supplementary files for audio examples of each stimulus type for both narrators.

Figure 11.
  • Download figure
  • Open in new tab
Figure 11.

Unaltered speech waveform (top left) and spectrogram (top right) compared to re-synthesized broadband peaky speech (middle left and right) and multiband peaky speech (bottom left and right). Comparing waveforms shows that the peaky speech is as “click-like” as possible, while comparing the spectrograms shows that the overall spectrotemporal content that defines speech is basically unchanged by the re-synthesis. A naïve listener is unlikely to notice that any modification has been performed, and subjective listening confirms the similarity. Yellow/lighter colors represent larger amplitudes than purple/darker colors in the spectrogram. See supplementary files for audio examples of each stimulus type for both narrators.

Broadband peaky speech

Voiced speech comprises rapid openings and closings of the vocal folds which are then filtered by the mouth and vocal tract to create different vowel and consonant sounds. The first processing step in creating peaky speech was to use speech processing software (PRAAT; Boersma and Weenink, 2018) to extract the times of these glottal pulses. Sections of speech where glottal pulses were within 17 ms of each other were considered voiced (vowels and voiced consonants like /z/). 17 ms is the longest inter-pulse interval one would expect in natural speech because it is the inverse of 60 Hz, the lowest pitch at which someone with a deep voice would likely speak. A longer gap in pulse times was considered a break between voiced sections. These segments were identified in a “mixer” function of time, with indices of 1 indicating unvoiced and 0 indicating voiced segments (and would later be responsible for time-dependent blending of re-synthesized and natural speech, hence its name). Transitions of the binary mixer function were smoothed using a raised cosine envelope spanning the time between the first and second pulses, as well as the last two pulses of each voiced segment. During voiced segments, the glottal pulses set the fundamental frequency of speech (i.e., pitch), which were allowed to vary from a minimum to maximum of 60–350 Hz for the male narrator and 90–500 Hz for the female narrator. For the male and female narrators, these pulses gave a mean ± SD fundamental frequency (i.e., pulse rate) in voiced segments of 115.1 ± 6.7 Hz and 198.1 ± 20 Hz respectively, and a mean ± SD pulses per second over the entire 64 s, inclusive of unvoiced periods and silences, of 69.1 ± 5.7 Hz and 110.8 ± 11.4 respectively. These pulse times were smoothed using 10 iterations of replacing pulse time pi with the mean of pulse times pi−1 to pi+1 if the log2 absolute difference in the time between pi and pi−1 and pi+1 was less than log2(1.6).

The fundamental frequency of voiced speech is dynamic, but the signal always consists of a set of integer-related frequencies (harmonics) with different amplitudes and phases. To create the waveform component at the fundamental frequency, f0(t), we first created a phase function, φ(t), which increased smoothly by 2π between glottal pulses within the voiced sections as a result of cubic interpolation. We then computed the spectrogram of the unaltered speech waveform – which is a way of analyzing sound that shows its amplitude at every time and frequency (Figure 11, top-right) – which we called A[t, f0(t)]. We then created the fundamental component of the peaky speech waveform as: Embedded Image

This waveform has an amplitude that changes according to the spectrogram but always peaks at the time of the glottal pulses.

Next the harmonics of the speech were synthesized. The kth harmonic of speech is at a frequency of (k + 1)f0 so we synthesized each harmonic waveform as: Embedded Image

Each of these harmonic waveforms has multiple peaks per period of the fundamental, but every harmonic also has a peak at exactly the time of the glottal pulse. Because of these coincident peaks, when the harmonics are summed to create the re-synthesized voiced speech, there is always a large peak at the time of the glottal pulse. In other words, the phases of all the harmonics align at each glottal pulse, making the pressure waveform of the speech appear “peaky” (left-middle panel of Figure 11).

The resultant re-synthesized speech contained only the voiced segments of speech and was missing unvoiced sounds like /s/ and /k/. Thus the last step was to mix the re-synthesized voiced segments with the original unvoiced parts. This was done by cross-fading back and forth between the unaltered speech and re-synthesized speech during the unvoiced and voiced segments respectively, using the binary mixer function created when determining where the voiced segments occurred. We also filtered the peaky speech to an upper limit of 8 kHz, and used the unaltered speech above 8 kHz, to improve the quality of voiced consonants such as /z/. Filter properties for the broadband peaky speech are further described below in the “Band filters” subsection.

Multiband peaky speech

The same principles to generate broadband peaky speech were applied to create stimuli designed to investigate the brainstem’s response to different frequency bands that comprise speech. This makes use of the fact that over time, speech signals with slightly different f0 are independent, or have (nearly) zero cross-correlation, at the lags for the ABR. To make each frequency band of interest independent, we shifted the fundamental frequency and created a fundamental waveform and its harmonics as: Embedded Image where Embedded Image and where fΔ is the small shift in fundamental frequency.

In these studies, we increased fundamentals for each frequency band by the square root of each successive prime number and subtracting one, resulting in a few tenths of a hertz difference between bands. The first, lowest frequency band contained the un-shifted f0. Responses to this lowest, un-shifted frequency band showed some differences from the common component for latencies > 30 ms that were not present in the other, higher frequency bands (Figure 4, 0–1 kHz band), suggesting some low-frequency privilege/bias in this response. Therefore, we suggest that following studies create independent frequency bands by synthesizing a new fundamental for each band. The static shifts described above could be used, but we suggest an alternative method that introduces random dynamic frequency shifts of up to ±1 Hz over the duration of the stimulus. From this random frequency shift we can compute a dynamic random phase shift, to which we also add a random starting phase, θΔ, which is drawn from a uniform distribution between 0 and 2π. The phase function from the above set of formulae would be replaced with this random dynamic phase function: Embedded Image

This random f0 shift method is described further in the supplementary material and validation data from one subject is provided in Supplemental Figure 2. Responses from all four bands show more consistent resemblance to the common component, indicating that this method is effective at reducing stimulus-related bias. However, low-frequency dependent differences remained, suggesting there is also unique neural-based low-frequency activity to the speech-evoked responses.

This re-synthesized speech was then band-pass filtered to the frequency band of interest (e.g. from 0–1 kHz or 2–4 kHz). This process was repeated for each independent frequency band, then the bands were mixed together and then these re-synthesized voiced parts were mixed with the original unaltered voiceless speech. This peaky speech comprised octave bands with center frequencies of: 707, 1414, 2929, 5656 Hz for experiments 1 and 2, and of 500, 1000, 2000, 4000, 8000 Hz for experiment 3. Note that for the lowest band, the actual center frequency was slightly lower because the filters were set to pass all frequencies below the upper cutoff. Filter properties for these two types of multiband speech are shown in the middle and right panels of Figure 14 and further described below in the “Band filters” subsection. For the dichotic multiband peaky speech, we created 10 fundamental waveforms – 2 in each of the five filter bands for the two different ears, making the output audio file stereo (or dichotic). We also filtered this dichotic multiband peaky speech to an upper limit of 11.36 kHz to allow for the highest band to have a center frequency of 8 kHz and octave width. The relative mean-squared magnitude in decibels for components of the multiband peaky speech (4 filter bands) and dichotic (audiological) multiband peaky speech (5 filter bands) are shown in Figure 12.

Figure 12.
  • Download figure
  • Open in new tab
Figure 12.

Relative mean-squared magnitude in decibels of multiband peaky speech with 4 filter bands (left) and 5 filter bands (right) for male-(blue) and female-(orange) narrated speech. The full audio comprises unvoiced and re-synthesized voiced sections, which was presented to the subjects during the experiments. The other bands reflect the relative magnitude of the voiced sections (voiced only), and each filtered frequency band.

For peaky speech, the re-synthesized speech waveform was presented during the experiment but the pulse trains were used as the input stimulus for calculating the response (i.e., the regressor, see Response derivation section below). These pulse trains all began and ended together in conjunction with the onset and offset of voiced sections of the speech. To verify which frequency ranges of the multiband pulse trains were independent across frequency bands, and would thus yield truly band-specific responses, we conducted a spectral coherence analyses on the pulse trains. All 60 unique 64 s sections of each male- and female-narrated multiband peaky speech used in the three experiments were sliced into 1 s segments for a total of 3,840 slices. Phase coherence across frequency was then computed across these slices for each combination of pulse trains according to the formula: Embedded Image where Cxy denotes coherence between bands x and y, E[] the average across slices, Embedded Image the fast Fourier transform, * complex conjugation, xi the pulse train for slice i in band x, and yi the pulse train for slice i in band y.

Spectral coherence for each narrator is shown in Figure 13. For the 4-band multiband peaky speech used in experiments 1 and 2 there were 6 pulse train comparisons. For the audiological multiband peaky speech used in experiment 3, there were 5 bands for each of 2 ears, resulting in 10 pulse trains and 45 comparisons. All 45 comparisons are shown in Figure 13. Pulse trains were coherent (> 0.1) up to a maximum of 71 and 126 Hz for male- and female-narrated speech respectively, which roughly correspond to the mean ± SD pulse rates (calculated as total pulses / 64 s) of 69.1 ± 5.7 Hz and 110.8 ± 11.4 respectively. This means that above ~130 Hz the stimuli were no longer coherent and evoked frequencyspecific responses. Importantly, responses would be to correlated stimuli, i.e., not frequency-specific, at frequencies below this cutoff and would result in a low-frequency response component that is present in (or common to) all band responses.

Figure 13.
  • Download figure
  • Open in new tab
Figure 13.

Spectral coherence of pulse trains for multiband peaky speech narrated by a male (left) and female (right). Spectral coherence was computed across 1 s slices from 60 unique 64 s multiband peaky speech segments (3,840 total slices) for each combination of bands. Each light gray line represents the coherence for one band comparison. There were 45 comparisons across the 10-band (audiological) speech used in experiment 3 (5 frequency bands x 2 ears). Pulse trains (i.e., the input stimuli, or regressors, for the deconvolution) were frequency-dependent (coherent) below 72 Hz for the male multiband speech and 126 Hz for the female multiband speech.

To identify the effect of the low-frequency stimulus coherence in the responses, we computed the common component across pulse trains by creating an averaged response to 6 additional “fake” pulses trains that were created during stimulus design but were not used during creation of the multiband peaky speech wav files. The common component was assessed for both “fake” pulse trains taken from shifts lower than the original fundamental frequency and those taken from shifts higher than the highest “true” re-synthesized fundamental frequency. To assess frequency-specific responses to multiband speech, we subtracted this common component from the band responses. Alternatively, one could simply high-pass the stimuli at 150 Hz using a first-order causal Butterworth filter (being mindful of edge artifacts). However, this high-pass filtering reduces response amplitude and may affect response detection (see Results for more details).

We also verified the independence of the stimulus bands by treating the regressor pulse train as the input to a system whose output was the rectified stimulus audio and performed deconvolution (see Deconvolution and Response derivation section below). Further details are provided in the supplementary material. The responses are given in Supplemental Figure 5, and showed that the non-zero responses only occurred when the correct pulse train was paired with the correct audio.

Band filters

Because the fundamental frequencies for each frequency band were designed to be independent over time, the band filters for the speech were designed to cross over in frequency at half power. To make the filter, the amplitude was set by taking the square root of the specified power at each frequency. Octave band filters were constructed in the frequency domain by applying trapezoids – with center bandwidth and roll-off widths of 0.5 octaves. For the first (lowest frequency) band, all frequencies below the high-pass cutoff were set to 1, and likewise for all frequencies above the low-pass cutoff for the last (highest frequency) band were set to 1 (Figure 14 top row). The impulse response of the filters was assessed by shifting the inverse FFT (IFFT) of the bands so that time zero was in the center, and then applied a Nuttall window, thereby truncating the impulse response to length of 5 ms (Fig 14 middle row). The actual frequency response of the filter bands was assessed by taking the FFT of the impulse response and plotting the magnitude (Figure 14 bottom row).

Figure 14.
  • Download figure
  • Open in new tab
Figure 14.

Octave band filters used to create re-synthesized broadband peaky speech (left, blue), diotic multiband peaky speech with 4 bands (middle, red), and dichotic multiband peaky speech using 5 bands with audiological center frequencies (right, red). The last band (2nd, 5th, 6th respectively, black line) was used to filter the high-frequencies of unaltered speech during mixing to improve the quality of voiced consonants. The designed frequency response using trapezoids (top) were converted into the time-domain using IFFT, shifted and Nuttall windowed to create impulse responses (middle), which were then used to assess the actual frequency response by converting into the frequency domain using FFT (bottom).

As mentioned above, broadband peaky speech was filtered to an upper limit of 8 kHz for diotic peaky speech and 11.36 kHz for dichotic peaky speech. This band filter was constructed from the second last octave band filter from the multiband filters (i.e., the 4–8 kHz band from the top-middle of Figure 14, dark red line) by setting the amplitude of all frequencies less than the high-pass cutoff frequency to 1 (Figure 14 top-left panel, blue line). As mentioned above, unaltered (unvoiced) speech above 8 kHz (diotic) or 11.36 kHz (dichotic) was mixed with the broadband and multiband peaky speech, which was accomplished by applying the last (highest) octave band filter (8+ or 11.36+ kHz band, black line) to the unaltered speech and mixing this band with the re-synthesized speech using the other bands.

Alternating polarity

To limit stimulus artifact, we also alternated polarity between segments of speech. To identify regions to flip polarity, the envelope of speech was extracted using a first-order causal Butterworth low-pass filter with a cutoff frequency of 6 Hz applied to the absolute value of the waveform. Then flip indices were identified where the envelope became less than 1 percent of the median envelope value, and then a function that changed back and forth between 1 and −1 at each flip index was created. This function of spikes was smoothed using another first-order causal Butterworth low-pass filter with a cutoff frequency of 10,000 Hz, which was then multiplied with the re-synthesized speech before saving to a wav file.

Response derivation

Deconvolution

The peaky-speech ABR was derived by using deconvolution, as in previous work (Maddox and Lee, 2018), though the computation was performed in the frequency domain for efficiency. The speech was considered the input to a linear system whose output was the recorded EEG signal, with the ABR computed as the system’s impulse response. As in Maddox and Lee (2018), for the unaltered speech, we used the halfwave rectified audio as the input waveform. Half-wave rectification was accomplished by separately calculating the response to all positive and all negative values of the input waveform for each epoch and then combining the responses together during averaging. For our new re-synthesized peaky speech, the input waveform was the sequence of impulses that occurred at the glottal pulse times and corresponded to the peaks in the waveform. Figure 15 shows a section of stimulus and the corresponding input signal of glottal pulses used in the deconvolution.

Figure 15.
  • Download figure
  • Open in new tab
Figure 15.

Left: A segment of broadband peaky speech stimulus (top) and the corresponding glottal pulse train (bottom) used in calculating the broadband peaky speech response. Right: An example broadband peaky speech response from a single subject. The response shows ABR waves I, III, and V at ~3, 5, 7 ms respectively. It also shows later peaks corresponding to thalamic and cortical activity at ~17 and 27 ms respectively.

The half-wave rectified waveforms and glottal pulse sequences were down-sampled to the EEG sampling frequency prior to deconvolution. To avoid temporal splatter due to standard downsampling, the pulse sequences were resampled by placing unit impulses at sample indices closest to each pulse time. Regularization was not necessary because the amplitude spectra of these regressors were sufficiently broadband. For efficiency, the time-domain response waveform, w, for a given 64 s epoch was calculated using frequency-domain division for the deconvolution, with the numerator the cross-spectral density (corresponding to the cross-correlation in the time domain) of the stimulus regressor and EEG response, and the denominator the power spectral density of the stimulus regressor (corresponding to its autocorrelation in the time domain). For a single epoch, that would be: Embedded Image where Embedded Image denotes the fast Fourier transform, Embedded Image the inverse fast Fourier Transform, * complex conjugation, x the input stimulus regressor (half-wave rectified waveform or glottal pulse sequence), and y the EEG data for each epoch. We used methods incorporated into the mne-python package (Gramfort et al., 2013). In practice, we made adaptations to this formula to improve SNR with Bayesian-like averaging (see below). For multiband peaky speech the same EEG was deconvolved with the pulse train of each band separately, and then with an additional 6 “fake” pulse trains to derive the common component across bands due to the pulse train coherence at low frequencies (shown in Figure 13). The averaged response across these 6 fake pulse trains, or common component, was then subtracted from the multiband responses to identify the frequency-specific band responses.

Response averaging

The quality of the ABR waveforms as a function of each type of stimulus was of interest, so we calculated the averaged response after each 64 s epoch. We followed a Bayesian-like process (Elberling and Wahlgreen, 1985) to account for variations in noise level across the recording time (such as slow drifts or movement artifacts) and to avoid rejecting data based on thresholds. Each epoch was weighted by its inverse variance, Embedded Image, to the sum of the inverse variances of all epochs. Thus, epoch weights, bi, were calculated as follows: Embedded Image where i is the epoch number and n is the number of epochs collected. For efficiency, weighted averaging was completed during deconvolution. Because auto-correlation of the input stimulus (denominator of the frequency domain division) was similar across epochs it was averaged with equal weighting. Therefore, the numerator of the frequency domain division was summed across weighted epochs and the denominator averaged across epochs, according to the following formula: Embedded Image where w is the average response waveform, i is again the epoch number, n is the number of epochs collected.

Due to the circular nature of the discrete frequency domain deconvolution, the resulting response has an effective time interval of [0, 32] s at the beginning and [−32, 0) s at the end, so that concatenating the two – with the end first – yields the response from [−32, 32] s. Consequently, to avoid edge artifacts, all filtering was performed after the response was shifted to the middle of the 64 s time window. To remove high-frequency noise and some low-frequency noise, the average waveform was band-pass filtered between 30–2000 Hz using a first-order causal Butterworth filter. An example of this weighted average response to broadband peaky speech is shown in the right panel of Figure 15. This bandwidth of 30 to 2000 Hz is sufficient to identify additional waves in the brainstem and middle latency responses (ABR and MLR respectively). To further identify earlier waves of the auditory brainstem responses (i.e., waves I and III), responses were high-pass filtered at 150 Hz using a first-order causal Butterworth filter. This filter was determined to provide the best morphology without compromising the response by comparing responses filtered with common high-pass cutoffs of 1, 30, 50, 100 and 150 Hz each combined with first, second and fourth order causal Butterworth filters.

Response normalization

An advantage of this method over our previous one (Maddox and Lee, 2018) is that because the regressor comprises unit impulses, the deconvolved response is given in meaningful units which are the same as the EEG recording, namely microvolts. With a continuous regressor, like the half-wave rectified speech waveform, this is not the case. Therefore, to compare responses to half-wave rectified speech versus glottal pulses, we calculated a normalization factor, g, based on data from all subjects: Embedded Image where n is the number of subjects, σu,i is the SD of subject i’s response to unaltered speech between 0– 20 ms, and σp,i is the same for the broadband peaky speech. Each subject’s responses to unaltered speech were multiplied by this normalization factor to bring these responses within a comparable amplitude range as those to broadband peaky speech. Consequently, amplitudes were not compared between responses to unaltered and peaky speech. This was not our prime interest, rather we were interested in latency and presence of canonical component waves. In this study the normalization factor was 0.26, which cannot be applied to other studies because this number also depends on the scale when storing the digital audio. In our study, this unitless scale was based on a root-mean-square amplitude of 0.01. The same normalization factor was used when the half-wave rectified speech as used as the regressor with EEG collected in response to unaltered speech, broadband peaky speech and multiband peaky speech (Figure 2, Supplemental Figure 1).

Response SNR calculation

We were also interested in the recording time required to obtain robust responses to re-synthesized peaky speech. Therefore, we calculated the time it took for the ABR and MLR to reach a 0-dB SNR. The SNR of each waveform in dB, SNRw, was estimated as: Embedded Image where Embedded Image represents the variance (i.e., mean-subtracted energy) of the waveform between 0 and 15 ms or 30 ms for the ABR and MLR respectively (contains both component signals as well as noise, S + N), and Embedded Image represents the variance of the noise, N, estimated by averaging the variances of 15 ms (ABR) to 30 ms (MLR) segments of the pre-stimulus baseline between −480 and −20 ms. Then the SNR for 1 min of recording, SNR60, was computed from the SNRw as: Embedded Image where tw is the duration of the recording in seconds, as specified in the “Speech stimuli and conditions” subsection. For example, in experiment 3, the average waveform resulted from 64 min of recording, or a tw of 3,840 s. The time to reach 0 dB SNR for each subject, t0dB SNR, was estimated from this SNR60 by: Embedded Image

Cumulative density functions were used to show the proportion of subjects that reached an SNR ≥ 0 dB and to determine the necessary acquisition times that can be expected for each stimulus on a group level.

Statistical Analyses

Data were checked for normality using the Shapiro-Wilk test. Waveform morphology of responses to different narrators was compared using Pearson correlations of the responses between 0 and 15 ms for the ABR waveforms or 0 and 40 ms for both ABR and MLR waveforms. The Wilcoxon signed rank test was used to determine whether narrator differences (waveform correlations) were significantly different than the correlations of the same EEG split into even and odd epochs with equal numbers of epochs from each narrator. The intraclass correlation coefficient type 3 (absolute agreement) was used to verify good agreement in peak latencies chosen by an experienced Audiologist and neuroscientist (MJP) at two different time points, 3 months apart. Independent t-tests with μ = 0 were conducted on the peak latency differences of ABR/MLR waves for unaltered and broadband peaky speech. For multiband peaky speech, the component wave peak latency changes across frequency band were assessed with linear mixed effects regression using the lme4 and lmerTest packages in R (Bates et al., 2015; Kuznetsova et al., 2017; R Core Team, 2020). A likelihood ratio test was used to determine that the model significantly improved upon adding an orthogonal 2nd order polynomial of frequency band (transformed so that each coefficient is independent), with linear and quadratic components to estimate the decrease in latency (slope) and change in the rate of latency decrease with increasing frequency band, respectively. Random effects of subject and each frequency band term were included to account for individual variability that is not generalizable to the fixed effects. A power analysis was completed using the simR package (Green and MacLeod, 2016), which uses a likelihood ratio test on 1000 Monte Carlo permutations of the response variables based on the fitted model.

FUNDING

This work was supported by National Institute for Deafness and Other Communication Disorders [R00DC014288] awarded to RKM.

DATA AVAILABILITY

Data and python code will be made available on the lab GitHub account (https://github.com/maddoxlab).

ACKNOWLEDGMENTS

The authors wish to thank Sara Fiscella for assistance with recruitment.

REFERENCES

  1. ↵
    Abdala C, Folsom RC. 1995. Frequency contribution to the click-evoked auditory brain-stem response in human adults and infants. J Acoust Soc Am 97:2394–2404. doi:10.1121/1.411961
    OpenUrlCrossRefPubMed
  2. ↵
    Backer KC, Kessler AS, Lawyer LA, Corina DP, Miller LM. 2019. A novel EEG paradigm to simultaneously and rapidly assess the functioning of auditory and visual pathways. J Neurophysiol 122:1312–1329. doi:10.1152/jn.00868.2018
    OpenUrlCrossRef
  3. ↵
    Bajo VM, King AJ. 2012. Cortical modulation of auditory processing in the midbrain. Front Neural Circuits 6:114. doi:10.3389/fncir.2012.00114
    OpenUrlCrossRefPubMed
  4. ↵
    Bajo VM, Nodal FR, Moore DR, King AJ. 2010. The descending corticocollicular pathway mediates learning-induced auditory plasticity. Nat Neurosci 13:253–260. doi:10.1038/nn.2466
    OpenUrlCrossRefPubMedWeb of Science
  5. ↵
    Bates D, Mächler M, Bolker B, Walker S. 2015. Fitting Linear Mixed-Effects Models Using lme4. J Stat Softw 67:1–48.
    OpenUrlCrossRefPubMed
  6. ↵
    Bernstein JGW, Grant KW. 2009. Auditory and auditory-visual intelligibility of speech in fluctuating maskers for normal-hearing and hearing-impaired listeners. J Acoust Soc Am 125:3358–3372. doi:10.1121/1.3110132
    OpenUrlCrossRefPubMedWeb of Science
  7. ↵
    Boersma P, Weenink D. 2018. Praat: doing phonetics by computer.
  8. ↵
    Bramhall N, Beach EF, Epp B, Le Prell CG, Lopez-Poveda EA, Plack CJ, Schaette R, Verhulst S, Canlon B. 2019. The search for noise-induced cochlear synaptopathy in humans: Mission impossible? Hear Res 377:88–103. doi:10.1016/j.heares.2019.02.016
    OpenUrlCrossRef
  9. ↵
    Burkard R, Hecox K. 1983. The effect of broadband noise on the human brainstem auditory evoked response. I. Rate and intensity effects. J Acoust Soc Am 74:1204–1213. doi:10.1121/1.390024
    OpenUrlCrossRefPubMedWeb of Science
  10. ↵
    Burkard R, Shi Y, Hecox KE. 1990. A comparison of maximum length and Legendre sequences for the derivation of brain-stem auditory-evoked responses at rapid rates of stimulation. J Acoust Soc Am 87:1656–1664. doi:10.1121/1.399413
    OpenUrlCrossRefPubMed
  11. ↵
    Carney LH, Li T, McDonough JM. 2015. Speech Coding in the Brain: Representation of Vowel Formants by Midbrain Neurons Tuned to Sound Fluctuations., eNeuro 2. doi:10.1523/ENEUR0.0004-15.2015
    OpenUrlCrossRef
  12. ↵
    Chiappa KH, Gladstone KJ, Young RR. 1979. Brain Stem Auditory Evoked Responses: Studies of Waveform Variations in 50 Normal Human Subjects. Arch Neurol 36:81–87. doi:10.1001/archneur.1979.00500380051005
    OpenUrlCrossRefPubMedWeb of Science
  13. ↵
    Dau T, Wegner O, Mellert V, Kollmeier B. 2000. Auditory brainstem responses with optimized chirp signals compensating basilar-membrane dispersion. J Acoust Soc Am 107:1530–1540. doi:10.1121/1.428438
    OpenUrlCrossRefPubMedWeb of Science
  14. ↵
    Don M, Allen AR, Starr A. 1977. Effect of Click Rate on the Latency of Auditory Brain Stem Responses in Humans. Ann Otol Rhinol Laryngol 86:186–195. doi:10.1177/000348947708600209
    OpenUrlCrossRefPubMed
  15. ↵
    Elberling C, Don M. 2008. Auditory brainstem responses to a chirp stimulus designed from derived-band latencies in normal-hearing subjects. J Acoust Soc Am 124:3022–3037. doi:10.1121/1.2990709
    OpenUrlCrossRefPubMed
  16. ↵
    Elberling C, Wahlgreen O. 1985. Estimation of Auditory Brainstem Response, Abr, by Means of Bayesian Inference. Scand Audiol 14:89–96. doi:10.3109/01050398509045928
    OpenUrlCrossRefPubMedWeb of Science
  17. ↵
    Forte AE, Etard O, Reichenbach T. 2017. The human auditory brainstem response to running speech reveals a subcortical mechanism for selective attention. eLife 6:e27203. doi:10.7554/eLife.27203
    OpenUrlCrossRef
  18. ↵
    Geisler CD, Frishkopf LS, Rosenblith WA. 1958. Extracranial Responses to Acoustic Clicks in Man. Science 128:1210–1211. doi:10.1126/science.128.3333.1210
    OpenUrlAbstract/FREE Full Text
  19. ↵
    Goldstein R, Rodman LB. 1967. Early components of averaged evoked responses to rapidly repeated auditory stimuli. J Speech Hear Res 10:697–705. doi:10.1044/jshr.1004.697
    OpenUrlCrossRefPubMed
  20. ↵
    Gorga MP, Johnson TA, Kaminski JR, Beauchaine KL, Garner CA, Neely ST. 2006. Using a Combination of Click- and Tone Burst–Evoked Auditory Brain Stem Response Measurements to Estimate Pure-Tone Thresholds. Ear Hear 27:60–74. doi:10.1097/01.aud.0000194511.14740.9c
    OpenUrlCrossRefPubMedWeb of Science
  21. ↵
    Gorga MP, Kaminski JR, Beauchaine KA, Jesteadt W. 1988. Auditory brainstem responses to tone bursts in normally hearing subjects. J Speech Hear Res 31:87–97. doi:10.1044/jshr.3101.87
    OpenUrlCrossRefPubMed
  22. ↵
    Gorga MP, Kaminski JR, Beauchaine KL, Bergman BM. 1993. A Comparison of Auditory Brain Stem Response Thresholds and latencies Elicited by Air- and Bone-Conducted Stimuli. Ear Hear 14:85–94.
    OpenUrlPubMed
  23. ↵
    Gramfort A, Luessi M, Larson E, Engemann DA, Strohmeier D, Brodbeck C, Goj R, Jas M, Brooks T, Parkkonen L, Hämäläinen M. 2013. MEG and EEG data analysis with MNE-Python. Front Neurosci 7. doi:10.3389/fnins.2013.00267
    OpenUrlCrossRefPubMed
  24. ↵
    Grant KW, Tufts JB, Greenberg S. 2007. Integration efficiency for speech perception within and across sensory modalities by normal-hearing and hearing-impaired individuals. J Acoust Soc Am 121:1164–1176. doi:10.1121/1.2405859
    OpenUrlCrossRefPubMed
  25. ↵
    Green P, MacLeod CJ. 2016. SIMR: an R package for power analysis of generalized linear mixed models by simulation. Methods Ecol Evol 7:493–498. doi:10.1111/2041-210X.12504
    OpenUrlCrossRef
  26. ↵
    Grothe B, Pecka M. 2014. The natural history of sound localization in mammals--a story of neuronal inhibition. Front Neural Circuits 8:116. doi:10.3389/fncir.2014.00116
    OpenUrlCrossRefPubMed
  27. ↵
    Hashimoto I. 1982. Auditory evoked potentials from the human midbrain: slow brain stem responses. Electroencephalogr Clin Neurophysiol 53:652–657. doi:10.1016/0013-4694(82)90141-9
    OpenUrlCrossRefPubMedWeb of Science
  28. ↵
    Hyde M. 2008. Ontario Infant Hearing Program Audiologic Assessment Protocol Version 3.1.
  29. ↵
    Jiang ZD, Wu YY, Wilkinson AR. 2009. Age-related changes in BAER at different click rates from neonates to adults. Acta Paediatr Oslo Nor 1992 98:1284–1287. doi:10.1111/j.1651-2227.2009.01312.x
    OpenUrlCrossRefPubMedWeb of Science
  30. ↵
    Kileny P, Paccioretti D, Wilson AF. 1987. Effects of cortical lesions on middle-latency auditory evoked responses (MLR). Electroencephalogr Clin Neurophysiol 66:108–120. doi:10.1016/0013-4694(87)90180-5
    OpenUrlCrossRefPubMedWeb of Science
  31. ↵
    Kuznetsova A, Brockhoff PB, Christensen RHB. 2017. lmerTest Package: Tests in Linear Mixed Effects Models. J Stat Softw 82:1–26. doi:10.18637/jss.v082.i13
    OpenUrlCrossRef
  32. ↵
    Larson E, McCloy D, Maddox R, Pospisil D. 2014. expyfun: Python experimental paradigm functions, version 2.0.0. Zenodo. doi:10.5281/zenodo.11640
    OpenUrlCrossRef
  33. ↵
    L’Engle M. 2012. A Wrinkle in Time: 50th Anniversary Commemorative Edition. Macmillan.
  34. ↵
    Liberman MC, Epstein MJ, Cleveland SS, Wang H, Maison SF. 2016. Toward a Differential Diagnosis of Hidden Hearing Loss in Humans. PLOS ONE 11:e0162726. doi:10.1371/journal.pone.0162726
    OpenUrlCrossRefPubMed
  35. ↵
    Maddox RK, Lee AKC. 2018. Auditory Brainstem Responses to Continuous Natural Speech in Human Listeners. eNeuro 5. doi:10.1523/ENEURO.0441-17.2018
    OpenUrlAbstract/FREE Full Text
  36. ↵
    Mesgarani N, David SV, Fritz JB, Shamma SA. 2009. Influence of Context and Behavior on Stimulus Reconstruction From Neural Activity in Primary Auditory Cortex. J Neurophysiol 102:3329–3339. doi:10.1152/jn.91128.2008
    OpenUrlCrossRefPubMedWeb of Science
  37. ↵
    Miller L, IV BM, Bishop C. 2017. Frequency-multiplexed speech-sound stimuli for hierarchical neural characterization of speech processing. US20170196519A1.
  38. ↵
    Møller AR, Jannetta PJ. 1983. Interpretation of brainstem auditory evoked potentials: results from intracranial recordings in humans. Scand Audiol 12:125–133.
    OpenUrlCrossRefPubMedWeb of Science
  39. ↵
    Moore JK. 1987. The human auditory brain stem as a generator of auditory evoked potentials. Hear Res 29:33–43. doi:10.1016/0378-5955(87)90203-6
    OpenUrlCrossRefPubMed
  40. ↵
    O’Sullivan AE, Lim CY, Lalor EC. 2019. Look at me when I’m talking to you: Selective attention at a multisensory cocktail party can be decoded using stimulus reconstruction and alpha power modulations. Eur J Neurosci 50:3282–3295. doi:10.1111/ejn.14425
    OpenUrlCrossRef
  41. ↵
    Picton TW, Hillyard SA, Krausz HI, Galambos R. 1974. Human auditory evoked potentials. I. Evaluation of components. Electroencephalogr Clin Neurophysiol 36:179–190. doi:10.1016/0013-4694(74)90155-2
    OpenUrlCrossRefPubMedWeb of Science
  42. ↵
    Polonenko MJ, Maddox RK. 2019. The Parallel Auditory Brainstem Response. Trends Hear 23:2331216519871395. doi:10.1177/2331216519871395
    OpenUrlCrossRef
  43. ↵
    Prendergast G, Guest H, Munro KJ, Kluk K, Léger A, Hall DA, Heinz MG, Plack CJ. 2017. Effects of noise exposure on young adults with normal audiograms I: Electrophysiology. Hear Res 344:68–81. doi:10.1016/j.heares.2016.10.028
    OpenUrlCrossRef
  44. ↵
    R Core Team. 2020. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
  45. ↵
    Rasetshwane DM, Argenyi M, Neely ST, Kopun JG, Gorga MP. 2013. Latency of tone-burst-evoked auditory brain stem responses and otoacoustic emissions: Level, frequency, and rise-time effects. J Acoust Soc Am 133:2803–2817. doi:10.1121/1.4798666
    OpenUrlCrossRefPubMed
  46. ↵
    Saiz-Alía M, Forte AE, Reichenbach T. 2019. Individual differences in the attentional modulation of the human auditory brainstem response to speech inform on speech-in-noise deficits. Sci Rep 9:1–10. doi:10.1038/s41598-019-50773-1
    OpenUrlCrossRefPubMed
  47. ↵
    Saiz-Alia M, Reichenbach T. 2020. Computational modeling of the auditory brainstem response to continuous speech. J Neural Eng. doi:10.1088/1741-2552/ab970d
    OpenUrlCrossRef
  48. ↵
    Scott M. 2007. The Alchemyst: the secrets of the immortal Nicholas Flamel, Book 1. New York: Listening Library.
  49. ↵
    Shore SE, Nuttall AL. 1985. High-synchrony cochlear compound action potentials evoked by rising frequency-swept tone bursts. J Acoust Soc Am 78:1286–1295. doi:10.1121/1.392898
    OpenUrlCrossRefPubMedWeb of Science
  50. ↵
    Stapells DR, Oates P. 1997. Estimation of the Pure-Tone Audiogram by the Auditory Brainstem Response: A Review. Audiol Neurotol 2:257–280. doi:10.1159/000259252
    OpenUrlCrossRefPubMed
  51. ↵
    Starr A, Hamilton AE. 1976. Correlation between confirmed sites of neurological lesions and abnormalities of far-field auditory brainstem responses. Electroencephalogr Clin Neurophysiol 41:595–608. doi:10.1016/0013-4694(76)90005-5
    OpenUrlCrossRefPubMedWeb of Science
  52. ↵
    Teoh ES, Lalor EC. 2019. EEG decoding of the target speaker in a cocktail party scenario: considerations regarding dynamic switching of talker location. J Neural Eng 16:036017. doi:10.1088/1741-2552/ab0cf1
    OpenUrlCrossRef
  53. ↵
    Winer JA. 2005. Decoding the auditory corticofugal systems. Hear Res 207:1–9. doi:10.1016/j.heares.2005.06.007
    OpenUrlCrossRefPubMedWeb of Science
View Abstract
Back to top
PreviousNext
Posted August 20, 2020.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Exposing distinct subcortical components of the auditory brainstem response evoked by continuous naturalistic speech
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Exposing distinct subcortical components of the auditory brainstem response evoked by continuous naturalistic speech
Melissa J Polonenko, Ross K Maddox
bioRxiv 2020.08.20.258301; doi: https://doi.org/10.1101/2020.08.20.258301
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Exposing distinct subcortical components of the auditory brainstem response evoked by continuous naturalistic speech
Melissa J Polonenko, Ross K Maddox
bioRxiv 2020.08.20.258301; doi: https://doi.org/10.1101/2020.08.20.258301

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Neuroscience
Subject Areas
All Articles
  • Animal Behavior and Cognition (2410)
  • Biochemistry (4765)
  • Bioengineering (3310)
  • Bioinformatics (14607)
  • Biophysics (6600)
  • Cancer Biology (5144)
  • Cell Biology (7389)
  • Clinical Trials (138)
  • Developmental Biology (4330)
  • Ecology (6841)
  • Epidemiology (2057)
  • Evolutionary Biology (9860)
  • Genetics (7322)
  • Genomics (9483)
  • Immunology (4517)
  • Microbiology (12615)
  • Molecular Biology (4909)
  • Neuroscience (28173)
  • Paleontology (198)
  • Pathology (800)
  • Pharmacology and Toxicology (1375)
  • Physiology (2005)
  • Plant Biology (4461)
  • Scientific Communication and Education (973)
  • Synthetic Biology (1295)
  • Systems Biology (3898)
  • Zoology (719)