ABSTRACT
The most prominent acoustic features in speech are intensity modulations, represented by the amplitude envelope of speech. Synchronization of neural activity with these modulations is vital for speech comprehension. As the acoustic modulation of speech is related to the production of syllables, investigations of neural speech tracking rarely distinguish between lower-level acoustic (envelope modulation) and higher-level linguistic (syllable rate) information. Here we manipulated speech intelligibility using noise-vocoded speech and investigated the spectral dynamics of neural speech processing, across two studies at cortical and subcortical levels of the auditory hierarchy, using magnetoencephalography. Overall, cortical regions mostly track the syllable rate, whereas subcortical regions track the acoustic envelope. Furthermore, with less intelligible speech, tracking of the modulation rate becomes more dominant. Our study highlights the importance of distinguishing between envelope modulation and syllable rate and provides novel possibilities to better understand differences between auditory processing and speech/language processing disorders.
INTRODUCTION
Intensity modulations of the acoustic envelope reflect the most prominent feature of the acoustic speech stream. Synchronization of neural activity with these modulations is vital for speech comprehension (1–4). As the acoustic modulation of speech and the production of syllables is correlated (5), investigations of neural speech tracking rarely distinguish between the lower-level acoustic and higher-level linguistic information. However, while the temporal scale of the acoustic modulation (∼4-5 Hz) is remarkably similar across languages, speakers, and speaking conditions (for reviews see 5, 6), the rate at which syllables are produced can vary significantly across (and within) languages (7), dialects and speaking conditions (8). Therefore, it remains unclear whether and how the brain differentially tracks low-level acoustic and linguistic information during natural continuous speech. Distinguishing these aspects more clearly may also be important in gaining a better understanding of the neural processes separating auditory processing disorders (e.g. hearing loss) from language processing disorders (e.g. developmental dyslexia).
The ability to process meaningful information from an acoustic sound stream becomes especially important in difficult listening situations. While some studies indicate a positive relationship between speech intelligibility and the synchronization of brain activity with the speech envelope (neural speech tracking) in the low-frequency range (1–4) others have reported inverse effects (9, 10). A recent study even suggested an inverted u-shaped relationship, where synchronization increases when speech is mildly degraded and decreases as speech becomes unintelligible (11). This wide range of (partly contradicting) results is suggestive of a complex relationship between the intelligibility of speech and the related neural dynamics of speech tracking.
One source of these seeming inconsistencies may be caused by the interpretation of band-limited differences, conflating periodic (center frequency, power, bandwidth) and aperiodic (offset, exponent) properties of the underlying signal (12). In fact, both the acoustic envelope of speech and electrophysiological measurements of neural activity possess an overall 1/f-like spectrum (13, 14). This 1/f-like pattern is also at times present in the low-frequency coherence/correlation spectrum between both signals (e.g. see 1, 9, 11). Recently, several approaches were proposed to separate periodic from aperiodic components of electrophysiological activity (IRASA (15); FOOOF (12)). We applied one of these approaches (FOOOF) to speech tracking, to parametrize the periodic components underlying low-frequency speech-brain coherence, such as the center frequency, the relative height of the coherence peak, and its bandwidth (∼tuning). Commonly, when investigating neural speech tracking these parameters are not separated from the aperiodic components of the coherence spectra. Instead (band/-averaged) contrasts over coherence spectra across several experimental conditions are computed, conflating the periodic and aperiodic components underlying speech-brain coherence. We propose that the periodic components (center frequency, relative height of the coherence peak, bandwidth) of speech-brain coherence offer a better estimate of neural speech tracking than broadband speech-brain coherence in the conventional frequency ranges. Therefore, it may be beneficial to investigate these parameters separately to better understand how neural activity tracks lower-level and higher-level information in a continuous speech stream and how this tracking is influenced by speech intelligibility.
Here, we applied this approach to two separate studies in which speech intelligibility was parametrically controlled via vocoding (3-, 7-Channels or no vocoding). Vocoding (16) is a popular technique to manipulate the intelligibility of speech that allows for high parametric control, while only moderately influencing the acoustic envelope of the signal (2). We captured the spectral dynamics of neural speech processing at cortical and subcortical levels (17) of the auditory hierarchy using magnetoencephalography (MEG). We observed that low-frequency speech brain coherence in accordance with previous results (1–4) declines with a decrease in intelligibility. However, parametrization of the coherence spectra revealed that this effect was mainly driven by the aperiodic components. The periodic components that are actually reflective of neural speech tracking (as opposed to band-limited coherence differences) were characterized by a narrower frequency tuning of the low-frequency coherence peak of vocoded speech along with an increase of its center frequency. The latter effect points to a shift of cortical tracking away from the syllabic rate towards the modulation rate of the acoustic envelope as vocoding increased. This effect is also seen for subcortical regions, although tracking is here overall dominated by the modulation rate.
RESULTS
Task performance declines with speech intelligibility
Subjects (N=55 across 2 experiments; Fig. 1B,C) listened to an audiobook (“Das Märchen”; Goethe, 1795) narrated by a female speaker whilst seated in the MEG. Parts of the audiobook presented were noise-vocoded (Fig. 1A; 7-Chan, 3-Chan). Vocoding levels were either kept constant throughout the audio presentation (Study#1; Fig. 1B) or changed intermittently (Study#2; Fig. 1C) to test the influences of vocoded speech on neural speech tracking under two different conditions. At the end of each audio presentation, subjects were presented with two nouns from which they had to pick the one they perceived in the previous sentence. The audio presentations were embedded in blocks that varied between 3.5 and 9 minutes (see Methods & Materials for a detailed account). Due to the overall low number of behavioral responses, we added an additional behavioral assessment (adjusted for each study) to investigate how vocoding influences speech comprehension. The task was similar to the one performed in the actual measurement but consisted of a larger amount of shorter trials (24; see Materials & Methods for a detailed account). Due to technical difficulties, only a subset (N=39) of our subjects participated in these assessments.
Task performance declined in both experiments with speech intelligibility, recognizable by a decrease in the mean hit rate. A one-way repeated measures ANOVA across the three conditions revealed a primary effect for Study#1 (F(2, 48) = 44.583, pggeisser = 7.35e-09, ηp2 = 0.65) and Study#2 (F(2, 26) = 24.536, p = 1e-06, ηp2 = 0.654). Comparing the different vocoding levels with each other showed higher hit rates for unvocoded stimuli than for stimuli vocoded with 7-Channels (Study#1, z(24) = 2.916, pfdr = 0.0035, d = 0.853; Study#2, z(13) = 2.566, pfdr = 0.0102, d = 1.39) or 3-Channels (Study#1, z(24) = 3.955, pfdr = 7.7e-05, d = 2.151; Study#2, z(13) = 2.720, pfdr = 0.0065, d = 2.280). Whereas stimuli vocoded with 7-Channels showed higher hit rates than stimuli vocoded with 3-Channels (Study#1, z(24) = 3.955, pfdr = 0.0002, d = 1.491; Study#2, z(13) = 2.572, pfdr = 0.0101, d = 1.265). Across all conditions hit rates differed significantly from chance (Study#1, Fig. 1B; Study#2, Fig. 1C): for unvocoded speech (Study#1, z(24) = 4.838, pfdr = 3.932e-06; Study#2, z(13) = 3.742, pfdr = 0.0005), for 7 vocoding channels (Study#1, z(24) = 4.483, pfdr = 1.103e-05; Study#2, z(13) = 3.355, pfdr = 0.0011) and for 3 vocoding channels (Study#1, z(24) = 3.625, pfdr = 0.0003; Study#2, z(13) = 3.105, pfdr = 0.0019). This shows that while speech comprehension gradually decreases with increases in vocoding, speech was still intelligible even when only 3-Channels were used to vocode the presented audio files.
Speech brain coherence declines with speech intelligibility
To investigate how a loss of speech intelligibility via noise-vocoding influences the neural dynamics of speech tracking we measured the coherence between the speech envelope and the related cortical activity (see coherence spectra in Fig. 2A). Comparisons of the coherence spectra across the three conditions (Original, 7-Channels and 3-Channels) using a cluster-corrected repeated-measures ANOVA, revealed a significant difference in the low-frequency range (averaged between 2 and 7Hz) for both Study#1 (p = 0.0004) and Study#2 (p = 9e-05). This difference was strongest in right superior temporal gyrus for both Study#1 and #2. Both in Study#1 and #2 listening to the unaltered audio resulted in the strongest speech-brain coherence, while the stimuli with a strong degradation (3-Channels) elicited the weakest coherence (Fig. 2C & 2F). Listening to the unaltered (“Original”) audio files elicited stronger speech-brain coherence than listening to speech vocoded with 7-Channels in Study#1 (t(27) = 2.519, pfdr = 0.018, d = 0.467) but not in Study#2 (t(26) = 1.425, pfdr = 0.166, d = 0.307). However, listening to the unaltered (“Original”) audio files elicited a stronger coherence than listening to speech in the 3-Channel condition (Study#1, t(27) = 6.083, pfdr = 3e-06, d = 1.623; Study#2, t(26) = 7.451, pfdr = 1.959e-07, d = 1.787). Listening to the 7-Channels condition elicited higher levels of speech-brain coherence than listening to the 3-Channels condition (Study#1, t(27) = 6.238, pfdr = 3e-06, d = 1.446; Study#2, t(26) = 7.021, pfdr = 2.802e-07, d = 1.599).
In sum, these results show that both intermittent and continuous degradation similarly affect low-frequency speech brain coherence. In both experimental designs, speech brain coherence decreased as speech became less intelligible. Comparing the decrease in coherence through vocoding across studies revealed that this decrease was not different across both studies (U = 297, p = .175, r = .214). At first glance, these results are in conflict with a previous analysis of Study#1 (11). The main difference between the previous and the current analysis of Study#1 can primarily be attributed to different filter settings (lower cut-off for the high-pass filter in the current analysis) during preprocessing that affected the offset and exponent of the speech-brain coherence spectrum differently (see Discussion). In the present study, these changes were applied to allow for better modeling of the periodic and the aperiodic components of the coherence spectrum. Interestingly, further analysis of these components showed that the aperiodic components explain most of the variance (Offset/Exponent; Study#1, r2 = 0.83/0.67; Study#2, r2 = 0.36/0.32) of the averaged (2-7Hz) low-frequency speech-brain coherence in both studies (see Supplementary Materials; Fig. S2). This illustrates that analysing coherence differences in a band-limited range may be strongly influenced by aperiodic differences that do not necessarily reflect neural tracking of sound or linguistic information in the relevant frequency range. Depending on the filter settings, these aperiodic components may heavily impact the results. This observation is especially important for investigations that focus on slow and infraslow modulations and highlights the necessity to separate periodic from aperiodic contributions.
Declining speech intelligibility increases the center frequency of neural speech tracking along with a sharper tuning
Both the speech envelope and electrophysiological signals (recorded using EEG/MEG) are characterized by an overall 1/f-like spectrum (13, 14). This appears to also be evident in the coherence estimation between both signals (independent of the speech-relevant peak at low frequencies; see Fig. 2A,D). To quantify relevant aspects of the periodic components of speech tracking we extracted the most prominent peaks of the coherence spectra in the low-frequency range across all virtual channels in which we observed a significant difference across vocoding levels (see Fig. 2B, E). This was operationalized by using FOOOF (12) to first flatten the coherence spectrum and then compute Gaussian model fits to extract peaks. For each subject, the average relative magnitude of the coherence peak, the bandwidth (∼tuning) and center frequency of the extracted peaks (Fig. 4) were computed and compared within subjects and across the three conditions (Original, 7-Channels and 3-Channels) using a repeated-measure ANOVA.
This analysis showed that the actual magnitude of the extracted peaks did not differ across the three vocoding conditions in both studies (Study#1, F(2, 54) = 0.522, p = 0.596, ηp2 = 0.019; Study#2, F(2, 50) = 2.18, p = 0.124, ηp2 = 0.08). However, we noticed a significant difference across the center frequencies of the detected peaks over the three conditions in both studies (Study#1, F(2, 54) = 48.628, p = 8.365e-13, ηp2 = 0.643; Study#2, F(2, 50) = 5.28, p = 0.008, ηp2 = 0.175). Comparing the different vocoding levels with each other showed lower center frequencies for unvocoded stimuli than for stimuli vocoded with 7-Channels (Study#1, t(27) = −7.122, pfdr = 1.753e-07, d = −1.271; Study#2, t(25) = −2.756, pfdr= 0.016, d = −0.613) and with 3-Channels (Study#1, t(27) = −8.797, pfdr = 6.18e-09, d = −1.918; Study#2, t(25) = −2.946, pfdr = 0.0161, d = −0.7). The two vocoding conditions did differ significantly from each other in Study#1 (t(27) = −3.227, pfdr = 3.273e-03, d = −0.544) but not in Study#2 (t(25) = −0.114, pfdr = 0.91, d = −0.023) with lower center frequencies for speech vocoded with 7-Channels compared to speech vocoded with 3-Channels.
For the bandwidth of the detected peaks, differences across the three conditions were also observed both in Study#1 (F(2, 54) = 18.808, p = 6.329e-07, ηp2 = 0.411) and Study#2 (F(2, 50) = 5.444, p = 0.007, ηp2 = 0.179). In the continuous design, (Study#1) the tuning bandwidth for unvocoded stimuli was broader than for stimuli vocoded with 7-Channels (t(27) = 3.219, pfdr = 0.003, d = 0.666) and with 3-Channels (t(27) = 5.196, pfdr = 5.4e-05, d =1.422). In the intermittent design (Study#2), the direction of the effect was similar, yet only significant for the difference between unvocoded speech and speech vocoded with 3-Channels (t(25) = 3.398, pfdr = 0.007, d = 0.983) and not for the difference between unvocoded speech and speech vocoded with 7-Channels (t(25) = 0.699, pfdr = 0.491, d = 0.201). Speech vocoded with 7-Channels had a broader tuning bandwidth than speech vocoded with 3-Channels across both studies (Study#1, t(27) = 3.592, pfdr = 0.002, d = 0.758; Study#2, t(25) = 2.668, pfdr = 0.02, d = 0.774).
In sum, these results show that intermittent and continuous degradation similarly affect the periodic components of speech-brain coherence that are putatively reflective of neural speech tracking. Interestingly, the difference between speech tracking across different levels of intelligibility was not driven by the relative height of the peak in the coherence spectrum, but rather by a sharper tuning (Fig. 4; Bandwidth) combined with an increase of center frequencies of the coherence spectra (Fig. 4; Center Frequency).
Neural speech tracking shifts from syllabic to modulation rate as speech intelligibility decreases
As speech intelligibility decreases we noted an increase of the center frequencies of speech-brain coherence. We also extracted the center frequencies of the modulation spectra from the acoustic envelopes of the audiobook for the three conditions (Original, 7-Channels, 3-Channels as in (9); see Fig. 4A) and computed the realized syllable rate of the presented audiobook (18). Although, there was generally a strong overlap over the modulation spectra of the speaker across vocoding levels (see Fig. 1A), a one-way repeated measures ANOVA across the extracted center frequencies and the syllable rate of the audio signal revealed a significant main effect (F(3, 1098) = 454.104, p = 2.68e-175, ηp2 = 0.554). All conditions differed significantly from each other (see Fig. 4 & Table S1 for a related post-hoc analysis). The rate at which the syllables were produced (Mdn = 4 Hz) was lower than the center frequencies of the modulation spectra of the audio signal 3-Channels (Mdn = 5.16 Hz), 7-Channels (Mdn = 5.5 Hz) and clear speech condition (Mdn = 6.16 Hz). The increase in center frequencies of speech-brain coherence along with the differences in modulation and syllable rates suggests that the brain may be driven more by acoustic or linguistic information depending on the signal quality. This is intuitive, as with increased vocoding it also becomes more difficult to extract linguistically meaningful information such as phrase boundaries or syllables. This mainly leaves the modulation intensities of the acoustic speech envelope as an information source to the listener. The following analysis aims at addressing this point more directly.
We trained and tested an ensemble of k-nearest neighbor classifiers to test whether neural speech tracking shifts from the syllabic (linguistic information) to the modulation rate (acoustic information) as speech becomes less intelligible. This analysis was performed in a nested 5-fold cross-validation (see Methods for a detailed account) to differentiate between the center frequencies of the modulation spectra for the three conditions (Original, 7-Channels, 3-Channels) and the realized syllable rate of the speaker (see Fig. 4A). The results of the nested cross-validation procedure (Fig. 4B) show that the classifiers can predict with a high accuracy whether a given frequency in hertz can be related either to the modulation or realized syllable rate of the speaker. We then used the weights of these classifiers to predict whether the extracted center frequencies of speech brain coherence were related more closely to the realized syllable rate or the modulation rate of our speaker. This analysis showed that in the unaltered clear speech condition, neural speech tracking was closely related to the syllable rate. However, as intelligibility decreases the probability that the classifiers predict that a given center frequency is related rather to the modulation as opposed to the syllabic rate increases.
The results of a two-way repeated measures ANOVA revealed that there was a significant main effect for the factors tracking (Modulation/Syllable rate) in both studies (Study#1 F(1, 27) = 16.175, pggeisser = 0.0004, ηp2 = 0.375; Study#2 F(1, 25) = 18.999, pggeisser = 0.0002, ηp2 = 0.432). The probability that neural speech tracking is reflective of the syllable rate (linguistic component) was overall higher than the tracking of the modulation rate (acoustic component). There was no significant main effect of Vocoding (Original, 7-Channels, 3-Channels; Study#1 F(2, 54) = 0, p = 1; Study#2 F(2, 48) = 0, p = 1) this is intuitive as the overall probability in each condition is 0.5 when ignoring the factor tracking (Modulation/Syllable rate). However, there was a significant interaction effect for the factors tracking (Modulation/Syllable rate) and vocoding (Original, 7-Channels, 3-Channels) across both Studies (Study#1 F(2, 54) = 47.340, pggeisser = 1.387e-11, ηp2 = 0.637; Study#2 F(2, 50) = 10.235, pggeisser = 0.0006, ηp2 = 0.29). This suggests that while speech intelligibility decreases and less linguistically meaningful information is present, neural speech tracking starts to drift away from the syllabic rate towards the modulation rate of speech.
Modelling of subcortical activity reveals a predominant tracking of the modulation rate of speech
Recent studies using non-invasive electrophysiology have shown that auditory activity at putative subcortical processing stages can be measured for complex natural sounds (such as speech; 19–22). Furthermore, this subcortical activity can even be modulated by attention (19, 20, 23). Interestingly, top-down attentional modulations of auditory activity can already be detected at the hair cells in the inner ear measured as otoacoustic activity (faint sounds emitted by the outer hair cells; see 24). Other studies have shown that even subcortical nuclei on the auditory pathway are behaviorally relevant for speech recognition (medial geniculate bodies; 25). Using a recently developed modeling procedure (17), we further aimed to investigate whether differences in speech intelligibility can be already observed at putative subcortical processing stages.
We used a localizer measurement (17) to compute individualized weights (per subject; note that the localizer was only available for Study#2). These weights reflect activity along the auditory hierarchy, resulting in 100 virtual channels ranging putatively from the auditory nerve (channels 0-20) to early thalamo(-cortical) processing stages (channels 90-100). We then applied these weights (see Material & Methods: Modelling of subcortical auditory activity) to the epoched data from Study#2 to infer activity along the auditory hierarchy (see spectral distribution in Fig. 5A). A cluster-corrected repeated-measures ANOVA across the three conditions (Original, 7-Channels and 3-Channels) and within subjects revealed a significant difference in the low frequency range (2-7Hz) between virtual channels that are reflective of subcortical activity at early stages of auditory processing (putatively auditory nerve/cochlear nucleus, p = 0.0045). Listening to the the unaltered (“Original”) audio files elicited higher speech-brain coherence than listening to the 7-Channels (t(24) = 3.2, pfdr = 0.005, d = 0.798) and the 3-Channels condition (t(24) = 4.282, pfdr = 0.0008, d = 1.212). However, the two vocoding conditions did not differ significantly from each other (t(24) = 1.547, pfdr = 0.135, d = 0.488).
We further investigated the periodic components that are reflective of speech tracking by extracting peaks from the coherence spectra to analyse the corresponding magnitude of the coherence peak, the bandwidth and center frequencies. A repeated-measures ANOVA across conditions (Original, 7-Channels and 3-Channels) and within subjects revealed no significant differences for the relative magnitude of the coherence peak (F(2, 48) = 0.335, p = 0.717, ηp2 = 0.014) and the bandwidth of the extracted peaks (F(2, 48) = 0.192, p = 0.826, ηp2 = 0.008). However, significant differences were found across conditions for the center frequencies of the peaks (F(2, 48) = 3.213, p = 0.049, ηp2 = 0.118). Listening to the unaltered (“Original”) audio files was associated with significantly lower center frequencies than listening to the 3-Channel condition (t(24) = −3.062, pfdr = 0.016, d = −0.664) but not than listening to the 7-Channel condition (t(24) = −0.767, pfdr = 0.450, d = −0.204) at subcortical processing stages. The two vocoding conditions did not differ significantly from each other (t(24) = −1.528, pfdr = 0.21, d = −0.425).
We applied the pre-trained classifiers (see Fig. 4) to detect whether the tracking of speech at putatively early auditory processing stages could be either related to the modulation rate of speech or the syllabic rate. We found that modelling of the related subcortical activity reveals predominantly a tracking of the acoustic modulation rate of speech (F(2, 48) = 23.220, p = 6.6e-05, ηp2 = 0.492), contrary to the previous analysis mainly reflecting cortical effects (see Fig. 4). However, similar to the previous analysis reflecting mainly cortical activity there was an interaction effect between tracking and vocoding, with decreasing intelligibility the probability increases that the classifiers rather predict that the center frequency is related to the modulation as opposed to the syllabic rate (F(2, 48) = 6.947, p = 0.00224, ηp2 = 0.224).
In sum, these results suggest that differences in speech tracking between clear and vocoded stimuli already arise at subcortical processing stages. This difference in neural speech tracking occurs at virtual channels that can be associated with subcortical activity between the auditory nerve and cochlear nucleus. The extracted peaks at this level of processing did not differ significantly from each other regarding the relative peak height of the coherence (similar to cortical observations; see Fig. 4) and their tuning width (different to cortical observations; see Fig. 4). Yet, the center frequency shift of these peaks showed a similar effect when compared to cortical processing stages. As intelligibility decreased the probability that the center frequencies could be related to the modulation opposed to the syllable rate increased steadily. However, contrary to the cortical recordings, tracking across different levels of intelligibility at subcortical processing stages was predominantly related to the modulation rate of speech rather than the syllabic rate. This shows that although tracking at a subcortical level is dominated by lower-level acoustic envelope modulations, intelligibility also influences these hierarchically early responses.
DISCUSSION
Speech tracking is modulated by the intelligibility of the sensory input. However, the pattern of that modulation-frequently operationalized by band-limited coherence effects- is not consistent across studies (see e.g. 9, 11, 26). This complicates a mechanistic understanding of how speech tracking actually supports speech comprehension. Applying a method to separate periodic from aperiodic components in the coherence spectrum, our results yield a differentiated picture, indicating that intelligibility affects tuning-width and center frequency of the periodic components in the low frequency range.
Band-limited speech-brain coherence declines with speech intelligibility
Here, we investigated the effects degraded speech has on the neural dynamics of speech tracking using data from two slightly different experimental paradigms. In Study#1 speech was displayed continuously at one of three different levels of intelligibility (Original, 7-Channels, 3-Channels; ∼15s-3min). In Study#2 segments of degraded speech (7-Channels, 3-Channels; ∼6-18s) were embedded in a clear audio stream (Original; ∼1-3min) as both studies produced comparable results, they will be discussed together. We observed in accordance with previous results (1–4) that low frequency speech-brain coherence declines with a decrease in intelligibility. However, other studies have reported a variety of partly contradicting results (9–11). Our present results show that the reported band-limited coherence spectra are very strongly related to the underlying aperiodic components in the spectrum (see supplementary material). Since the field is mostly interested in neural tracking of (relatively) periodic speech features around the syllable rate, it is questionable whether band-limited coherences without consideration of the aperiodic components are a viable measure for neural speech tracking.
Neural speech tracking shifts from syllabic to modulation rate as speech intelligibility decreases
Interestingly, in the investigation of spectral power differences in electrophysiological signals, a variety of contradicting results is also commonly reported for band-limited effects. This appears to be caused by the conflation of periodic (center frequency, power, bandwidth) and aperiodic (offset, exponent) properties of the underlying signal (12). This is deemed problematic as periodic and aperiodic components of the signal can be linked to a variety of different effects (12). Both the acoustic envelope of speech and electrophysiological measurements of neural activity possess an overall (aperiodic) 1/f-like spectrum (13, 14). This 1/f-like pattern is at times also found in the low-frequency coherence/correlation spectrum between both signals (e.g. see 1, 9, 11). We therefore decomposed the speech-brain coherence spectra in their periodic and aperiodic components using FOOOF (12), to better understand the relationship between the intelligibility of speech and the related neural dynamics of speech tracking. Interestingly, these investigations revealed that the aperiodic components (offset & exponent) explained most of the variance observed in the coherence difference (at 2-7 Hz; see Fig. 2) across vocoding levels (see Fig. S2). This highlights the importance of separating periodic from aperiodic components in the speech-brain coherence spectra, as we were primarily interested in investigating peaks in the coherence spectra (periodic components) that can be related to neural speech tracking. Further investigations of the periodic components of the low frequency coherence peak (center frequency, relative magnitude, bandwidth) revealed that there was no difference across vocoding levels in the relative magnitude of the coherence peak. Instead, the differences in neural speech tracking were rather caused by a sharpening in the frequency tuning of the coherence peak of vocoded speech along with an increase of the center frequencies of the observed peaks (see Fig. 3). Using a decoding analysis, we were able to link the increase of the center frequencies to a shift in tracking from higher-level linguistic to lower-level acoustic information of the speech stream. Our analysis showed that as intelligibility decreases the probability that tracking is related to the modulation (acoustic) rather than the syllabic rate (linguistic) of speech increases. This is intuitive, as with decreased intelligibility it also becomes more difficult to extract linguistically meaningful information such as phrase boundaries or syllables. This mainly leaves the modulation intensities of the acoustic speech envelope as an information source to the listener. As the acoustic modulation of speech is closely related to the production of syllables (5), investigations of neural speech tracking are typically not making a distinction between lower-level acoustic and higher-level linguistic information on the level of syllable processing. However, while the modulation rate (acoustic property) of speech appears to be exceptionally stable across languages and speaking conditions (5, 6), the syllable rate (linguistic property) of speech differs depending on the language and the speaking conditions (7, 8). This suggests that modulation rate and syllable rate are not terms that can be necessarily used interchangeably. Therefore, distinguishing these properties more clearly may be important to gain a better understanding of the neural processes separating auditory processing disorders (e.g. hearing loss) from language processing disorders (e.g. developmental dyslexia), which has been difficult based solely on neural speech tracking. This difficulty may be linked to the variety of (partly contradicting) results within and across auditory/linguistic processing disorders that relate to the neural dynamics of speech tracking. While a recent study was able to link hearing loss to a relative increase in speech envelope tracking (compared to (age matched) normal hearing listeners (27)), previous studies could not report enhanced envelope tracking in individuals with a hearing-impairment (28, 29). Related to language proficiency, similar inconsistencies are reported as non-native speakers appear to show an increased envelope tracking compared to native speakers (10, 30). On the other hand, individuals suffering from developmental dyslexia are reported to have lower synchronization with the speech envelope compared with neurotypical individuals (31). This range of (partly contradicting) results again highlights the complex relationship between the intelligibility of speech and the related neural dynamics of speech tracking. Using the approach proposed here of decomposing coherence spectra in their periodic and aperiodic components, it should be possible to gain a more fine-grained view on the specific characteristics underlying the neural dynamics of speech tracking. This may help in the future to better differentiate the neural signatures of individuals suffering from auditory processing or language processing disorders.
Declining speech intelligibility goes along with a sharper frequency tuning
Apart from the intelligibility-dependent changes in the center frequencies of the coherence peaks, we also noted a wider frequency tuning of speech tracking in clear as opposed to vocoded speech. The width of this frequency tuning decreased with a loss in intelligibility. As the syllabic rate of our speaker (∼4Hz) differed from the modulation rate of her speech stream (∼5-6Hz; see Fig. 4), the narrowing in tuning may also be related to a loss in linguistically meaningful information. This might suggest that in situations where speech is clear, both linguistic (syllable rate) and acoustic information (modulation rate) were tracked resulting in an increased bandwidth covering all relevant frequencies. As speech becomes less intelligible and it becomes harder to extract linguistically meaningful information, the bandwidth of the coherence peak narrows around the higher frequencies of the residual acoustic modulation of speech. Furthermore, previous studies have shown that auditory selective attention effects may arise from an enhanced tuning of receptive fields of task-relevant neural populations (32, 33). Therefore, the observed narrower frequency tuning could also be related to enhanced top-down auditory attention processes (34) in situations where listening becomes more challenging.
Influences of aperiodic components on the neural dynamics of speech tracking
Investigating the parameters related to low frequency peaks in measurements of speech brain coherence is offering a new and unique perspective to better understand the neural dynamics underlying speech tracking. However, we also noticed that band-limited differences in the speech-brain coherence spectra are strongly related to the underlying aperiodic components. This highlights the importance to separate periodic from aperiodic components, as periodic and aperiodic components can be linked to a variety of different effects (12). Commonly, aperiodic components of most signals have been considered as noise and as such are often just removed from the overall signal. Especially for low-frequency activity this can be easily achieved by spectrally normalising (whitening) the signal via filtering (e.g. see 35). Different choices in filter settings however can also generally accentuate different properties of a signal. For instance, in this study we reanalysed data from a recent study (Study#1 (11)) using a larger time window for the coherence estimation (4s instead of 2s; to obtain a better frequency resolution for low frequency speech tracking) and a lower cut-off for the high-pass filter (0.1Hz instead of 1Hz). These changes were intended to improve the model fit of FOOOF for the low frequency coherence spectra, but also resulted in a different pattern for low frequency speech-brain coherence (compare Fig. 2ABC) with Fig. 2ab) in (11)). The previous analysis of Study#1 (11) showed that neural speech tracking increases for mild decreases in intelligibility (putatively driven by an increased listening effort) and then decreases as speech becomes increasingly unintelligible. We now show that low frequency speech tracking gradually decreases with intelligibility. This difference was mainly driven by changes in filter settings accentuating different properties of the signal by putatively differently influencing the 1/f-like pattern of low frequency speech-brain coherence. Similar to the analysis of power spectral densities, 1/f-like patterns in the coherence spectra also appear to play a striking role when computing statistics across experimental conditions (see S2 for a comparison of slope and offset for the data analysed in the present study). However, whether or not 1/f-like patterns carry (in general) meaningful information is heavily debated. Nevertheless, recent studies have shown that 1/f-like patterns in electrophysiological power spectra can change both dependent on trait-like factors (age (36), ADHD (37) and schizophrenia (38)) and state-like factors (e.g. differences over cognitive and perceptual states (39, 40).This suggests a physiologically meaningful underpinning of 1/f-like neural activity. However, interpretations related to the aperiodic patterns found in low frequency speech-brain coherence go beyond the scope of the present study, as we were mainly focused on the distinction between the processing of the syllabic rate and the modulation rate of speech related to peaks in the speech-brain coherence spectra (periodic components). Perhaps aperiodic components of speech-brain coherence could be modulated by slower components in the speech stream reflecting higher level information (e.g. sentence or phrasal information), that become increasingly lost with less intelligibility. Addressing this question should be the topic of future investigations using paradigms in which these features are parametrically controlled (41). However, the present study illustrates that analysing coherence in a band-limited range, even though more or less explicitly assumed, may not reflect neural tracking of sound or linguistic information in the relevant frequency range. Instead, depending on the filter settings, the aperiodic components may heavily impact the results. This is especially important for investigations that focus on slow and infraslow modulations.
Modelling of subcortical activity reveals a predominant tracking of the modulation spectrum of speech
Previous research has shown that not only cortical, but also subcortical regions play an important role in language processing (42). These subcortical regions appear to be even behaviorally relevant for speech recognition (medial geniculate bodies; 25). Here, we generated individualized spatial filters reflective of subcortical auditory processing using a localizer measurement (17). In principal, these filters can be applied to a separate measurement to infer subcortical auditory activity. Using this modeling procedure, we aimed to investigate whether differences in speech intelligibility can already be observed at putative subcortical processing stages. Similarly to the activity from cortical processing stages, we noticed a shift of the center frequency of the extracted peaks. As intelligibility decreased, the center frequencies of the detected peaks increased steadily. However, contrary to the cortical recordings, the applied decoding analysis showed that the center frequencies of the speech-brain coherence peaks (reflecting neural speech tracking) across different levels of intelligibility at subcortical processing stages was predominantly related to the modulation rate of speech opposed to the syllabic rate. This shows that although tracking at a subcortical level is overall higher for the low-level acoustic envelope modulation, intelligibility also influences these hierarchically early responses (see Fig. 5D). This highlights the potentially important yet often overlooked role of subcortical nuclei in speech and language processing.
CONCLUSION
In this study, we introduce a novel way to investigate neural speech tracking by utilizing an approach recently introduced to parametrize electrophysiological power spectra (12). Our results show that cortical regions mostly track the syllable rate, whereas subcortical regions are driven by the acoustic modulation rate. Furthermore, the less intelligible speech becomes, the more dominant the tracking of the modulation rate becomes. Our study underlines the importance of making a distinction between the acoustic modulation and syllable rate of speech and provides novel possibilities to better understand differences between auditory processing and speech/language processing disorders. In general, parametrization of coherence spectra may offer a new and unique perspective to investigate the parameters that drive neural speech tracking across a variety of listening situations.
MATERIALS & METHODS
Subjects
Twenty-eight individuals participated in Study#1 (female = 17, male = 11). Mean age was 23.82 years (standard deviation, SD = 3.71) with a range between 19 and 37 years. In Study#2 twenty-seven individuals participated (female = 11, male = 16). Due to technical difficulties one subject was removed from Study#2. Mean age was 23.38 years (SD = 4.15) with a range between 19 and 38 years. Across both studies we recruited only German native speakers and people who were suitable for MEG recordings, that is, without nonremovable ferromagnetic metals in or close to the body. Participants provided informed consent and were compensated monetarily or via course credit. Participation was voluntary and in line with the declaration of Helsinki and the statutes of the University of Salzburg. The study was approved by the ethical committee of the University of Salzburg.
Stimuli
For the MEG recording, audio files were extracted from audio–visual recordings of a female speaker reading Goethe’s “Das Märchen” (“The Tale”; 1795). In Study#1, lengths of 12 stimuli varied between approximately 15 s and 3 min, with two stimuli of 15, 30, 60, 90, 120 and 150 s, and 6 of 180 s. Stimuli were presented in 3 blocks with 4 stimuli in each block. In Study#2, two or three segments of degraded speech (7-Channels, 3-Channels; 4.8-21.6 s) were embedded in 15 clear audio streams. The lengths of the 15 stimuli varied between 60 s and 3 min with two stimuli of 60, 90, 120, and 9 of 180 s. Stimuli were presented in 5 blocks with 3 stimuli in each block. In both studies, each stimulus ended with a two-syllable noun within the last four words. In order to keep participants’ attention on the stimulation, we asked participants after each stimulus to choose from two presented two-syllable nouns, the one that had occurred within the last four words of a sentence. The sequence of all the audio stimuli was randomized across participants, not following the original storyline of the audiobook. The syllable rate of the stimuli varied between 3.1 and 4.3 Hz with a median of 4 Hz (estimated using Praat (18)).
Vocoding
Noise-vocoding of all audio stimuli was done using the vocoder toolbox for MATLAB (43), and we created conditions with 7 and 3 channels (Fig. 1A). Vocoding for both studies was performed as described in (11). For the vocoding, the waveform of each audio stimulus was passed through two Butterworth analysis filters (for 7 and 3 channels) with a range of 200–7,000 Hz representing equal distances along the basilar membrane. Amplitude envelope extraction was done with half-wave rectification and low-pass filtering at 250 Hz. The envelopes were then normalized in each channel and multiplied with the carrier. Then, they were filtered in the band and the RMS of the resulting signal was adjusted to that of the original signal filtered in that same band. Auditory stimuli were presented binaurally using MEG-compatible pneumatic in-ear headphones (SOUNDPixx, VPixx technologies).
Behavioral Assessment
Due to the low number of behavioral responses from the MEG part, we added an additional behavioral assessment. For Study#1 and Study#2, 24 audio files were created from recordings of another female native German speaker reading Antoiné St. Exupery’s “The little prince” (1943). Each stimulus contained one sentence (length between 2-15 s) and was either presented unvocoded with 7-channel vocoding or 3-channel vocoding. For Study #1, the stimuli in the vocoding condition were vocoded from start to the end; for Study #2, the stimuli were vocoded only in the last 0.6-5 s. Comparable to both MEG experiments, the stimuli also ended with a two-syllable noun within the last four words, and participants were asked to choose the last noun they heard between two nouns on the screen. The sequence of all audio stimuli was random across the participants, not following the storyline. In each study, the hit rates across the three vocoding conditions were compared using one-way repeated measures ANOVAs. Post-Hoc analysis was performed using FDR (44) corrected Wilcoxon signed-rank tests (as the assumptions for paired samples t-tests were violated).
Data Acquisition
Data acquisition and parts of the data analysis for Study#1 and #2 closely resemble, with minor exceptions, the one described in two previous studies (11, 45). Magnetic brain activity was recorded using a 306-channel whole head MEG system (TRIUX, Elekta Oy, Finland) with a sampling rate of 1 kHz for the main experiments (Study#1 and Study#2) and with a sampling rate of 10 kHz for the brainstem localizer in Study#2 (see Backward Modeling for further information). The system consists of 204 planar gradiometers and 102 magnetometers. Before entering the magnetically shielded room (AK3B, Vakuumschmelze, Hanau, Germany), the head shape of each participant was acquired with >300 digitized points on the scalp, including fiducials (nasion, left and right pre-auricular points) with a Polhemus FASTRAK system (Polhemus, Vermont, USA). The auditory brainstem response was measured with a single electrode located on FpZ based on the electrode placement of the international 10–20-System (46). A ground electrode was placed on the forehead at midline and a reference on the clavicle bone of the participants.
Data Analysis
Preprocessing
All data analysis steps for Study #1 and #2 were performed similarly and are therefore reported together. The acquired data was Maxwell-filtered using a Signal Space Separation (SSS) algorithm (47) implemented in the Maxfilter program (version 2.2.15) provided by the MEG manufacturer to remove external magnetic interference from the MEG signal and realign data to a common standard head position (-trans default Maxfilter parameter). The Maxwell-filtered and continuous data was then further analysed using the FieldTrip toolbox (48) and custom built Matlab routines. First, the data was high-pass filtered at 0.1 Hz using a finite impulse response (FIR) filter (Kaiser window). For extracting physiological artefacts from the data, 50 independent components were calculated from the filtered data. Via visual inspection, the components showing eye-movements & heartbeats were removed from the data. On average across studies, 3 components were removed per subject (SD = 1). Then, trials related to each of the three conditions (Original, 7-Channels and 3-Channels) were defined. The acoustic speech envelope was extracted and aligned with the measured MEG data (11). Afterwards data was cut into segments of 4 seconds to increase signal-to-noise ratio.
Source Analysis
Anatomical template images were warped to the individual head shape and brought into a common space by co-registering them based on the three anatomical landmarks (nasion, left and right preauricular points) with a standard brain from the Montreal Neurological Institute (MNI, Montreal, Canada) (49). Afterwards a single-shell head model (50) was computed for each participant. As a source model, a grid with 1 cm resolution and 2982 voxels based on an MNI template brain was morphed into the brain volume of each participant. This allows group-level averaging and statistical analysis as all the grid points in the warped grid belong to the same brain region across subjects. Common linearly constrained minimum variance (LCMV) beamformer spatial filters (51) were then computed on the preprocessed MEG data and applied to project the single-trial time series into source space. The number of epochs across conditions was equalized (by the lowest number of epochs across conditions within each study). We applied a frequency analysis to the 4-s segments of all three conditions (original, 7-Chan and 3-Chan) calculating multi-taper frequency transformation (dpss taper: 0–25 Hz in 0.25 Hz steps, 4 Hz smoothing, no baseline correction). For the coherence calculation between each virtual sensor and the acoustic speech envelope, 0.25-Hz frequency steps were chosen. Then, the coherence between activity at each virtual sensor and the acoustic speech envelope during acoustic stimulation in the frequency spectrum was calculated and averaged across trials. We refer to the coherence between acoustic speech envelope and brain activity as neural speech tracking. Most studies on neural speech tracking report findings of frequencies below 7 Hz; we, therefore, analysed frequencies between 2 and 7 Hz. We applied repeated-measures ANOVAs for each frequency within the range (ft_statfun_depsamplesFunivariate in FieldTrip) to test modulations of neural measures across the different intelligibility levels. To control for multiple comparisons, a nonparametric cluster-based permutation test test was undertaken (52). The test statistic was repeated 10,000 times on data shuffled across conditions and the largest statistical value of a cluster coherent in source space was kept in memory. The observed clusters were compared against the distribution obtained from the randomization procedure and were considered significant when their probability was below 5%. Effects were identified in source space. All voxels within the cluster and the corresponding individual coherence and power values were extracted and averaged. Post hoc paired samples t tests between conditions were corrected for multiple comparisons by using the FDR method (44) implemented in Pingouin (53). Slopes of the change in coherence along with changes in intelligibility were compared across studies using a Mann-Whitney U test. For visualization, source localizations were averaged across the 2–7 Hz frequency bands and mapped onto inflated surfaces as implemented in FieldTrip.
Peak Analysis
For further analysis of the coherence spectra in source space, we extracted the most prominent peaks in the low frequency range (2-7Hz) across all virtual channels in which we observed a significant difference across vocoding levels (579 channels for Study#1 and 417 for Study#2; see Fig. 2 B, E). This was operationalized by using FOOOF (12) to flatten the coherence spectrum at each virtual channel and compute Gaussian model fits to extract peaks. For each subject, the average peak height, bandwidth and center frequency of the extracted peaks (see Fig. 3) were computed. Peaks were only considered if they exceeded a threshold relative to the aperiodic slope of 1.5 standard deviations (peak_threshold=1.5). Bad model fits were dropped (one bad model fit in Study#2). If the residual model fits differed from the rest based on the R2 (between the input spectrum and the full model fit), or error of the full model fit by more than 2.5 SDs, they were dropped. The most prominent peak in the range between 2 and 7 Hz was extracted per virtual channel. Peak and aperiodic parameters were then averaged across all virtual channels and further analysed using repeated-measures ANOVAs and dependent-samples t-tests (as implemented in Pingouin (53)) for post-hoc analysis (corrected for multiple comparisons using the FDR method (44)).
Analysis of modulation- and syllable rate
We estimated the modulation and syllabic rate of all twelve audio files for each condition (Original, 7-Channels, 3-Channels). Audio files were transformed to 6-s duration segments (as in (6)) resulting in 386 audio segments per condition (Original, 7-Channels, 3-Channels). The modulation rates for the three different levels of intelligibility were then extracted using custom matlab scripts taken from (6). The center frequency of each spectrum was further extracted by taking the global maximum value of each modulation spectrum. The realized syllable rate of the speaker was computed using Praat (18). The center frequencies of the three conditions and syllable rate were then compared using repeated-measures ANOVAs and dependent-samples t-tests (as implemented in Pingouin (53)) for post-hoc analysis (corrected for multiple comparisons using the FDR method (44)). Afterwards an ensemble (50 classifiers) of k-nearest neighbor classifiers were trained in a nested 5-fold cross-validation (54) to decode whether a given frequency can be associated with either the modulation or the syllabic rate. We decided to use the k-nearest neighbor classifiers as data had only a low number of features (i.e. one center frequency per audio segment); a classification problem usually solved well by a k-nearest neighbor approach (55). The repeated nested cross-validation procedure was chosen to avoid overfitting of hyperparameters. Each external cross-validation loop was embedded in a repeated stratified k-folding procedure (RepeatedStratifiedKFold; 25 repetitions) the best number of neighbors was determined by searching the hyper-parameter space for the best cross-validation (CV) score of a kNN model using the implemented GridSearchCV function and by computing the area under the receiver operating characteristic curve (roc-auc) as loss-function. Confusion matrices were then computed on a separated test set (10% of all data) that was not part of the initial inner cross-validation to avoid overfitting of hyperparameters. Confusion matrices of each inner loop were kept in memory and averaged across all repetitions (150 repetitions). The procedure was implemented using sci-kit learn(56) and custom written python scripts. The code used for the analysis can be found in the corresponding authors gitlab repository (see Data & Code Availability). The trained classifiers were subsequently applied to the center frequencies from speech-brain coherence (see Peak Analysis) to determine whether a frequency was rather related to the modulation- or the syllabic rate. The corresponding probabilities were then compared using a two-way repeated measures ANOVA with the factors tracking (Modulation/Syllable rate) and vocoding (Original, 7-Channels, 3-Channels).
Modeling of subcortical auditory activity
In order to reconstruct auditory brainstem activity from the MEG data we applied a recently developed backward modelling approach (17) to the data obtained in Study#2. As planar gradiometers are less sensitive to sources below the cortical surface than magnetometers (57) only magnetometer data was included in this analysis. The backward models were trained independently for each subject using data obtained from a localizer run dedicated to elicit auditory brainstem activity (see (17) for a detailed account). In brief, we used the signal captured by the MEG sensors (during the first 10ms) as regressors for a concurrent EEG recording of an auditory brainstem response (similar to the estimation of regression based ERPs (58)). The corresponding weights (a time-generalized representation of auditory brainstem activity) were then applied to the upsampled (10 000 Hz) single-trial time series data from Study#2. Afterwards, the data was downsampled (100Hz) and a frequency analysis was applied to the 4-s segments of all three conditions (original, 7-Chan and 3-Chan) calculating multi-taper frequency transformation (dpss taper: 0–25 Hz in 0.25 Hz steps, 4 Hz smoothing, no baseline correction) for the analyses of the coherence calculation between each virtual sensor and the acoustic speech envelope. Afterwards analysis steps that were performed for the previous analysis were repeated for the modeled activity (see statistics reported in source analysis and steps undertaken for peak and decoding analysis).
Competing Interests
The authors declare no competing financial interests.
Data/Code Availability
The data and code necessary for generating the figures and computing statistics will be shared in the corresponding authors gitlab repository (https://gitlab.com/schmidtfa). Access to raw data will be made available upon reasonable request.
Author contributions
Conceptualization: NW, AH, AK, FS, YC
Data curation: YC, FS, AH
Formal Analysis: FS, AH, YC
Funding acquisition: NW, AK, SR, RH, MS
Investigation: YC, FS, AH
Methodology: FS, NW
Project administration: NW
Software: FS, AH
Supervision: NW, RH, MS, AH, AK Visualization: FS
Writing—original draft: FS
Writing—review & editing: AH, AK, NW, YC, RH
Acknowledgements
This research was supported by the Sivantos GmbH and an FWF Einzelprojekt (P 31230).