Delta-band Cortical Tracking of Acoustic and Linguistic Features in Natural Spoken Narratives

Speech contains rich acoustic and linguistic information. During speech comprehension, cortical activity tracks the acoustic envelope of speech. Recent studies also observe cortical tracking of higher-level linguistic units, such as words and phrases, using synthesized speech deprived of delta-band acoustic envelope. It remains unclear, however, how cortical activity jointly encodes the acoustic and linguistic information in natural speech. Here, we investigate the neural encoding of words and demonstrate that delta-band cortical activity tracks the rhythm of multi-syllabic words when naturally listening to narratives. Furthermore, by dissociating the word rhythm from acoustic envelope, we find cortical activity primarily tracks the word rhythm during speech comprehension. When listeners’ attention is diverted, however, neural tracking of words diminishes, and delta-band activity becomes phase locked to the acoustic envelope. These results suggest that large-scale cortical dynamics in the delta band are primarily coupled to the rhythm of linguistic units during natural speech comprehension.


18
When listening to speech, low-frequency cortical activity in the delta (< 4 Hz) 19 and theta (4-8 Hz) bands is phase locked to speech (Keitel et    Here, we first asked whether cortical activity could track disyllabic words 60 composed of two monosyllabic morphemes in semantically coherent stories. 61 The story was either naturally read or synthesized as an isochronous 62 sequence of syllables as in previous studies that demonstrate linguistic 63 structure tracking (Ding et al., 2016a). We then asked whether neural tracking

74
Neural tracking of words in isochronously presented narratives 75 We first presented semantically coherence stories that were synthesized as 76 an isochronous sequence of syllables (Fig. 1A, left). For the stories with a 77 metrical structure, every other syllable must be a word onset. More 78 specifically, the odd terms in the metrical syllable sequence were always the 79 initial syllable of a word, while the even terms were either the second syllable 80 of a disyllabic word (73% probability) or a monosyllabic word (23% 81 probability). In the following, the odd terms of the syllable sequence were 82 referred to as σ1, while the even terms were referred to as σ2. Since the 83 syllables were presented at a constant rate of 4 Hz, the neural response 84 tracking syllables was frequency tagged at 4 Hz. Furthermore, since every 85 other syllable in the sequence was the onset of a word, neural activity phase 86 locked to word onsets was expected to show a regular rhythm at half of the 87 syllabic rate, i.e., 2 Hz (Fig. 1A, right). As a control condition, we also presented stories with a nonmetrical structure 90 (Fig. 1B). These stories were referred to as the nonmetrical stories in the 91 following. In these stories, the word duration was not constrained and σ1 was 92 not always a word onset. Consequently, the word onsets in these stories did 93 not show rhythmicity at 2 Hz, and neural activity phase locked to word onsets 94 was not frequency tagged at 2 Hz. monosyllabic words, so that the odd terms in the syllable sequence (referred to as σ1) 100 must be the onset of a word. Here the onset syllable of each word is shown in bold. All 101 syllables are presented at a constant rate of 4 Hz. A 500-ms gap is inserted at the 102 position of any punctuation. The red curve illustrates cortical activity that is phase locked 103 to word onsets. It shows a 2-Hz rhythm, which can be clearly observed in the spectrum 104 shown on the right. The stories are in Chinese and the English examples are shown for 105 illustrative purposes. (B) In the nonmetrical stories, word onsets are not regularly 106 positioned, and activity that is phase locked to word onsets does not show 2-Hz 107 rhythmicity. (C) Natural speech. The stories are naturally read by a human speaker and 108 the duration of syllables is not constrained. During the analysis, the response to natural 109 speech is time warped to simulate the response to isochronous speech, and the time-110 warped word-tracking response is expected to show a 2-Hz rhythm. (D) Amplitude 111 modulated isochronous speech is constructed by amplifying either σ1 or σ2 by a factor of 112 4, creating a 2-Hz acoustic envelope. The red and blue curves illustrate responses that 113 are phase locked to word onsets and amplified syllable, respectively. The word-tracking 114 response is identical for σ1-and σ2-amplified speech, while the response tracking the 115 amplified syllable shows a 180° degree phase shift between conditions. 116 117 All stories were presented to two groups of participants who performed 118 different tasks. One group of participants was asked to attend to the stories 119 and answer 3 comprehension questions after each story. The participants 120 correctly answered 96% ± 9% and 94% ± 9% questions for metrical and 121 nonmetrical stories, respectively. The other group of participants, however, 122 was asked to watch a silent movie while listening and did not have to answer   Figure 2A showed the responses from participants who attended to stories.

138
The response topography showed a centro-frontal distribution, which was 139 maximal near channel FCz.  Neural tracking of words in natural spoken narratives 152 We next asked whether cortical activity could track disyllabic words in natural 153 speech. The same set of stories used in the isochronous speech condition 154 was naturally read by a human speaker and presented to the participants. The 155 participants correctly answered 95% ± 4% and 97% ± 6% questions for 156 metrical and nonmetrical stories, respectively. In natural speech, syllables 157 were not produced at a constant rate (Fig. 1C), therefore the syllable-and  were referred to as the σ1-amplified and σ2-amplified conditions. When 181 listening to amplitude modulated speech, the participants correctly answered 182 94% ± 12% and 97% ± 8% questions in the σ1-amplified and σ2-amplified 183 conditions, respectively.    The ERP responses evoked by σ1 and σ2 were separately shown in Fig. 4.

288
This analysis was restrained to disyllabic words so that the responses to σ1 289 and σ2 reflected the responses to the first and second syllables of disyllabic 290 words. When participants attended to speech, the ERP responses to σ1 and 291 σ2 showed significant differences for both isochronous (Fig. 4A) and natural 292 speech (Fig. 4B). When participants watched a silent movie, a smaller 293 difference was also observed between the ERP responses to σ1 and σ2 (Fig.   294 4A). The topography of the ERP difference wave showed a centro-frontal 295 distribution. For isochronous speech, the ERP latency could not be 296 unambiguously interpreted for isochronous speech since the stimulus was 297 strictly periodic. For natural speech, the ERP responses to σ1 and σ2 differed 298 in a time window between 300-to 500-ms latency. The results for amplitude 299 modulated isochronous speech were shown in Fig. 4CD. When the 300 participants attended to speech, a sharp ERP near the onset of the amplified 301 syllable was observed (Fig. 4C). When the participants attended to the silent 302 movie, the ERP was also stronger for the amplified syllable but the waveform 303 was smoother (Fig. 4D).     previous study on the neural responses to naturally spoken sentences has 419 also shown that the initial syllable of an English word elicits a larger N1 and 420 N200-300 than the middle syllable (Sanders and Neville, 2003). A recent study 421 also suggests that the word onset in natural speech elicits a response at ~100 422 ms latency (Brodbeck et al., 2018). Language difference is a potential reason 423 why the current study did not observe this early effect: In Chinese, syllables 424 are generally also morphemes while in English most syllables do not carry 425 meaning. The 400-ms latency response observed here is consistent with the 426 hypothesis that the N400 is related to lexical processing (Friederici, 2002;427 Kutas and Federmeier, 2011). It is also possible that the N400 is weaker for 428 the second syllable since the second syllable in a word is more predictable 429 than the first syllable (Lau et al., 2008).

431
The difference between the ERPs evoked by the first and second syllables is 432 amplified by attention (Fig. 4AB). Nevertheless, the ERP difference remains 433 significant when participants attend to a silent movie, while the 2-Hz word-rate 434 response in the power spectrum is no longer significant (Fig. 2B). In the 435 response phase analysis, however, it can be observed that the 2-Hz response 436 phase still shows consistency across participants during the movie watching 437 task (Fig. 3). The inter-participant phase consistency is likely to explain why 438 the ERP, which is averaged over participants, can more reliably reflect neural  In sum, the current study strongly suggest that cortical activity can track the 447 word rhythm during naturalistic speech listening and delta-band EEG  constrained. These two kinds of stories were referred to as metrical stories 466 (Fig. 1A) and nonmetrical stories, respectively (Fig. 1B). Ideally, the metrical 467 stories should be constructed solely with disyllabic words, forming a constant 468 disyllabic word rhythm. Nevertheless, since it was difficult to construct such 469 materials, the stories were constructed with disyllabic words and pairs of 470 monosyllabic words. In other words, whenever a monosyllabic word appeared, 471 it appeared in pairs. After the stories were composed, the word boundaries 472 within stories were further parsed based on a Natural Language Processing 473 (NLP) algorithm (Zhang and Shang, 2019). The parsing result confirmed that 474 every other syllable in the story (referred to as σ1 in Fig. 1A) was the onset of 475 a word. For the other syllables (referred to as σ2 in Fig. 1A), 77% was the 476 second syllable of a disyllabic word while 23% was a monosyllabic word. were not aware of the purpose of the study. In the natural speech, syllables 500 were not produced at a constant rate and the boundaries between syllables 501 were labelled by professionals (Fig. 1C). The total duration of speech was 502 1122 s for the 21 metrical stories and 372 s for the 7 nonmetrical stories.   The next story was presented after an interval randomized between 1 and 2 s 532 (uniform distribution) after the key press. The participants took a break 533 between the two blocks.  The stories were presented ~5 minutes after the movie started, to make sure 539 that participants were already engaged in the movie-watching task. The 540 interval between stories randomized between 1 and 2 second (uniform 541 distribution). The movie was stopped after all 28 stories were presented.    In the frequency domain analysis, the statistical tests were performed using 609 bias-corrected and accelerated bootstrap (Efron and Tibshirani, 1994). In the 610 bootstrap procedure, all participants were resampled with replacement 10000 611 times. To test the significance of the 2-Hz and 4-Hz peaks in the response 612 spectrum (Fig 2A-E), the response amplitude at the peak frequency was  To test the phase difference between conditions, the V% confidence interval of 626 the phase difference was measured by the smallest angle that could cover 627 V% of the 10000 resampled phase difference. In the inter-trial phase 628 coherence test and the inter-participant phase coherence test (Fig. 3), 10000 629 phase coherence values were generated based on the null distribution, i.e., a 630 uniform distribution. If the actual phase coherence was smaller than N of the 631 10000 phase coherence values generated based on the null distribution, its 632 significance level was (N + 1)/10001. When multiple comparisons were 633 performed, the p-value was adjusted using the false discovery rate (FDR) 634 correction (Benjamini and Hochberg, 1995).

636
In the time domain analysis (Fig. 4), the significant ERP difference between 637 conditions was determined using the cluster-based permutation test (Maris 638 and Oostenveld, 2007). The test was performed following steps given below:

639
(1) The ERP for each participants in two conditions were pooled into the same  Table 1).

663
The power was above 0.8 for all conditions, suggesting that the sample size 664 was big enough even for the more conservative t-test.