Abstract
Studies of speech processing investigate the relationship between temporal structure in speech stimuli and neural activity (Giraud & Poeppel, 2012; Gross et al., 2013; Park, Ince, Schyns, Thut, & Gross, 2015). The speech envelope, a representation of speech that captures amplitude fluctuations in the acoustic speech signal, has been used to probe this temporal relationship. The envelope is most commonly understood to reflect the succession of syllables at ∼5 Hz (Ding et al., 2017; Greenberg, Carvey, Hitchcock, & Chang, 2003; Räsänen, Doyle, & Frank, 2018; Varnet, Ortiz-Barajas, Erra, Gervain, & Lorenzi, 2017). Despite clear evidence for auditory tracking of the envelope at lower frequencies (Gross et al., 2013; Park et al., 2015), it is not well understood what linguistic information is captured in the speech envelope beyond the syllable level. Here, we harness linguistic theory to focus on Intonation Units (IUs), while analyzing their temporal structure as captured in the speech envelope.
IUs are prosodic units (Chafe, 1994; Du Bois, Cumming, Schuetze-Coburn, & Paolino, 1992; Himmelmann, Sandler, Strunk, & Unterladstetter, 2018) which are defined by a specific pattern of syllable delivery, together with resets in pitch and articulatory force. Linguistic studies of spontaneous speech indicate that this prosodic segmentation paces new information in language use across numerous and diverse languages. Therefore, IUs provide an important structural cue for the cognitive dynamics producing and comprehending speech universally.
We study the relation between IUs and the periodic components of the speech envelope in six languages. We apply analysis methods from investigations of neural synchronization (Vinck, van Wingerden, Womelsdorf, Fries, & Pennartz, 2010) to study recordings from every-day speech contexts of over 100 speakers. We find that sequences of IUs form a consistent 1 Hz rhythm. Our results demonstrate that IUs form a significant periodic cue within the speech envelope that could be utilized by the cognitive and neural systems when tracking speech.
1. Introduction
Speech processing is commonly investigated by the measurement of brain activity as it relates to the acoustic speech stimulus. Such research has revealed that neural activity tracks amplitude modulations present in speech. It is generally agreed that a dominant element in neural tracking of speech is a 5 Hz rhythmic component, which corresponds to the rate of syllables in speech. The speech stimulus is also tracked at lower frequencies (<4 Hz), but the functional role of these fluctuations is not fully understood. They are assumed to relate to the “musical” elements of speech which are above the word level – called prosody. However, prosody in the neuroscience literature is rarely investigated for its structure and function in cognition.
In linguistics, on the other hand, prosody is investigated for its structural and functional roles in speech and cognition. This line of research has identified prosodic segmentation cues that are common to all languages, and that characterize what is termed Intonation Units (IUs; Figure 1a). Importantly, in addition to providing a systematic segmentation to ongoing naturalistic speech, IUs capture the pacing of information, parceling a maximum of one new idea per IU. Thus, IUs provide a valuable construct for quantifying how ongoing speech serves cognition in individual and interpersonal contexts. The first goal of our study is to introduce this understanding of prosodic segmentation from linguistic theory to the neuroscientific community. The second goal is to put forth a temporal characterization of IUs, and hence offer a precise, theoretically driven interpretation of the low-frequency auditory tracking and its relevance to cognition.
(a) An example intonation unit sequence from a conversation in Du Bois et al. (2005). (b) Illustration of one of the characteristics contributing to the delimitation of IUs: the fast-slow dynamic of syllables. A succession of short syllables is followed by comparatively longer ones; new units are cued by the resumed rush in syllable rate following the lengthening (syllable duration measured in ms). (c) Illustration of the phase-consistency analysis: 2-second windows of the speech envelope (green) were extracted around each IU onset (gray vertical line), decomposed and compared for consistency of phase angle within each frequency. (d) Illustration of the IU-onset permutation in time, which was used to compute the randomization distribution of phase consistency spectra (see Materials and Methods).
Regardless of the phonological, morphosyntactic and so-called rhythmic structure of a language, when speakers talk, they produce their utterances in chunks with a specific prosodic profile. The prosodic profile intersects rhythmic, melodic, and articulatory characteristics (Chafe, 1994; Du Bois et al., 1992; Himmelmann et al., 2018). Rhythmically, chunks may be delimited by pauses, but more importantly, by a fast-slow dynamic of syllables (Figure 1b). Melodically, chunks have a continuous pitch contour, which is typically sharply reset at the onset of a new unit. In terms of articulation, the degree of contact between articulators is strongest at the onset of an IU (Keating, Cho, Fougeron, & Hsu, 2003).
Across languages, these auditorily-defined units are also functionally comparable in that they pace the flow of information in the course of speech (Chafe, 1987, 1994; Du Bois, 1987; Pawley & Syder, 2000). For example, when speakers develop a narrative, they do so gradually, introducing the setting, participants and the course of events in sequences of IUs, where no more than one new piece of information relative to the preceding discourse is added per IU (Box 1). This has been demonstrated both by means of qualitative discourse analysis (e.g., Chafe, 1987, 1994; Ono & Thompson, 1995), and by quantifying the average amount of content items per IU. Specifically, the amount of content items per IU was found to be very similar across languages albeit strikingly different grammatical profiles (Himmelmann et al., 2018). Another example for the shared role of IUs in different languages pertains to the way speakers plan their (speech) actions. When a speaker coordinates a transition during a turn-taking sequence they rely on prosodic segmentation (i.e., IUs): semantic/syntactic phrasal completion is not a sufficient cue for predicting when a transition will take place, and IU design is found to serve a crucial role in timing the next turn-taking transition (Bögels & Torreira, 2015; Ford & Thompson, 1996; Gravano & Hirschberg, 2011).
Here we use recordings of spontaneous speech in natural settings to characterize the temporal structure of sequences of IUs in six languages. The sample includes well-studied languages from the Eurasian macro area - languages that are spoken by large or relatively large speech communities (English, Russian and Hebrew), as well much lesser-known and -studied languages, spoken in the Indonesian-governed part of Papua by smaller speech communities. Importantly, our results generalize across this linguistic diversity, despite the substantial differences in socio-cultural settings and all aspects of grammar, including other prosodic characteristics (Himmelmann et al., 2018). In contrast to previous research, we estimate the temporal structure of IUs using direct time measurements rather than estimations based on word count or syllable length (Box 2). In addition, we quantify the temporal structure of IUs in relation to the speech envelope, which is an acoustic representation relevant to neural processing of speech. We find that sequences of IUs form a consistent 1 Hz rhythm in the six sample languages, and relate this finding to recent neuroscientific accounts of the roles of slow rhythms in speech processing.
Information is temporally structured in social interaction
Time is crucial for organizing information in interaction, not only via prosodic cues but also through body conduct, such as gaze-direction, head and hand gestures, leg movements and body torques (Mondada, 2018). This is especially evident in task-oriented interaction, for example, in direction-giving sequences during navigation, or instruction sequences more broadly, where the many deictic words (e.g., this, there) can only be interpreted correctly when accompanied by a timely gesture. Following is such a fragment from a judo-instruction class (Du Bois et al., 2005). The prosodic-based segmentation to IUs is represented by a break in line. To facilitate reading and considering the current focus on function rather than form, transcription conventions were simplified (the original transcription can be retrieved with the sound file from the linked corpus). Speaker overlap is marked by square brackets ([]), minimally audible to medium pauses (up to 0.6 seconds) are marked by sequences of dots (…) and punctuation marks represent different classes of pitch movements, indicating roughly the degree of continuity between one unit and the next (comma – continuing; period – final; double dash – cut short).
What is a word?
The notion of a word has been argued to be untenable for both language-specific and cross-linguistic analyses (e.g., Haspelmath, 2011). We demonstrate why this is so for cross-linguistic comparison with the following example in Seneca, a member of the Northern Iroquoian branch of the Iroquoian language family spoken in Northeast America (Chafe, 2015, p. 191).
The first line includes a word in the language, that is, a unit of meaning whose unit-ness is defined by morphosyntactic processes in the language. The second line includes a breakdown to meaning components (separated by hyphens and tabs), obtained through comparative evidence from related languages. Due to extensive sound changes over the years, Seneca shows a high degree of fusion between meaning components, that is, the boundaries between them are obscured and not necessarily available to speakers. Note also that these meaning components cannot normally appear as independent words, that is, without the neighboring meaning components. The third line includes a gloss per meaning component, differentiating between those with grammatical meaning (part of a grammatical paradigm; in small caps) and content items. The fourth line includes the corresponding English translation. As evident from this example, a noteworthy distinction between Seneca and some better-known languages of the world such as English is that Seneca regularly packages an event, its participants and other meaning components within a single morphosyntactic word. Consequently, for one and the same message, Seneca IUs would contain fewer words compared to English IUs.
2. Materials and Methods
2.1 Data
We studied the temporal structure of IUs using six corpora of conversations and narratives that were transcribed and segmented into IUs according to the unified criteria devised by Chafe, Du Bois and colleagues (Chafe, 1994; Du Bois et al., 1992). Three of the corpora were segmented by specialist teams working on their native language: The Santa Barbara Corpus of Spoken American English (Du Bois et al., 2005), The Haifa Corpus of Spoken Hebrew (Maschler et al., 2017), and The Prosodically Annotated Corpus of Spoken Russian (Kibrik & Podlesskaya, 2014). The other three corpora were segmented by teams with varying degrees of familiarity with the languages, as part of a project studying the ability to identify IUs in unfamiliar languages: The DoBeS Summits-PAGE Collection of Papuan Malay (Himmelmann & Riesberg, 2016), The DoBeS Wooi Documentation (Kirihio et al., 2015), and The DoBes Yali Documentation (Riesberg, Walianggen, & Zöllner, 2016). Further information regarding the sample may be found in Table 1. Appendix A includes additional information regarding the construction of the sample and the coding and processing of IUs. We also analyzed in the same manner a corpus of an additional language that was transcribed semi-automatically to prosodic units following slightly different rhythmic and melodic criteria: Rhapsodie: A Prosodic-Syntactic Treebank for Spoken French (Lacheret et al., 2014). Further information regarding this additional analysis can be found in Appendix B. From all language samples, we extracted IU onset times, noting which speaker produced a given IU. Additionally, we computed the speech envelope for each sound file following standard procedure (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009; Gross et al., 2013; Figure 1c and Appendix A).
Summary information on the sample of speech segments used in the study.
2.2 Phase-consistency analysis
We analyzed the relation between IU onsets and the speech envelope using a point-field synchronization measure, adopted from the study of rhythmic synchronization of neural spiking activity and Local Field Potentials (Vinck et al., 2010). In this analysis, the rhythmicity of IU sequences is measured through the phase consistency of IU onsets with the periodic components of the speech envelope (Figure 1c). The speech envelope is typically modulated at a rate of 2-7 Hz (Chandrasekaran et al., 2009; Ding et al., 2017; Varnet et al., 2017), and is understood to reflect predominantly the sequence of syllables in speech. The vocal nuclei of syllables are the main source of envelope peaks, while syllable boundaries are the main source of envelope troughs (Figure 1b). IU onsets can thus be expected to coincide with troughs in the envelope, since each IU onset is necessarily also a syllable boundary. Therefore, one can expect a high phase consistency between IU onsets and the frequency component of the speech envelope corresponding to the rhythm of syllables, at ∼5 Hz.
A less trivial finding would be a high phase consistency between IU onsets and other periodic components in the speech envelope. Specifically, since IUs typically include more than one syllable, such an effect would pertain to frequency components below ∼5 Hz. In this analysis we hypothesized that the syllable organization within IUs gives rise to slow periodic components in the speech envelope. If low-frequency components are negligible in the speech envelope, estimating the phase of the low-frequency components at the time of IU onsets would lead to random phase angles, a result that would translate to low phase consistency. In another scenario, if the speech envelope captures slow rhythmicity in language other than that arising from IUs, different IUs would occur in different phases of the lower frequency components, translating again to low phase consistency. Contrary to these scenarios, finding phase consistency at a degree higher than expected under the null hypothesis would indicate both that the speech envelope captures the rhythmic characteristics of IUs and would characterize the period of this rhythmicity.
We extracted 2-second windows of the speech envelope centered on each IU onset, and decomposed them using FFT with a single Hann window, no padding, and after demeaning. This yielded phase estimations for frequency components at a resolution of 0.5 Hz. Then, we measured the consistency in phase of each FFT frequency component across speech segments using the pairwise phase consistency metric (PPC; Vinck et al., 2010), yielding a consistency spectrum. We calculated consistency spectra separately for each speaker that produced > 5 IUs and averaged the spectra within each language. Note, that the PPC measure is unbiased by the number of 2-second envelope windows entering the analysis (Vinck et al., 2010), and likewise that in a turn-taking sequence, it is inevitable that part of the 2-second envelope windows capture speech by more than one participant. The results remain unchanged when creating a single spectrum per conversation without separating speakers, or when including all speakers that produced more than two IUs (not shown).
2.3 Statistical assessment
We assessed the statistical significance of peaks in the average consistency spectra using a randomization procedure (Figure 1d). Per language, we created a randomization distribution of consistency estimates with 1000 sets of average surrogate spectra. These surrogate spectra were calculated using the speech envelope as before, but with temporally permuted IU onsets that maintained the association with envelope troughs. Troughs are defined by a minimum magnitude of 0.01 (on a scale of 0-1), and with a minimal duration between troughs of 200 ms, as would be expected from syllables, on average. By constraining the temporal permutation of IU onsets, we address the fact that each IU onset is necessarily a syllable onset, and therefore is expected to align with a trough in the envelope. We then calculated, for each frequency, the proportion of consistency estimates (in the 1000 surrogate spectra) that were greater than the consistency estimate obtained for the observed IU sequences. We corrected p-values for multiple comparisons across frequency bins ensuring that on average, FDR will not exceed 1% (Genovese, Lazar, & Nichols, 2002).
Results
We studied the temporal structure of IU sequences through their alignment with the periodic components of the speech envelope, using a phase-consistency analysis. This analysis builds on a cognitively-oriented linguistic theory which is supported by empirical speech analysis with wide cross-linguistic validity. We hypothesized that one of the characteristics of IUs – the fast-slow dynamic of syllables – would give rise to slow periodic modulations in the speech envelope. Figure 2 displays the observed phase consistency spectra in the six sample languages. IU onsets appear at significantly consistent phases of the low-frequency components of the speech envelope, indicating that their rhythm is captured in the speech envelope, hierarchically above the syllabic rhythm at 5 Hz. We find the consistent phase relation in all six languages (English, Hebrew, Russian, Papuan Malay, Wooi and Yali) at approximately 1 Hz.
Phase-consistency analysis results. Shaded regions denote ±SEM across speakers. Significance is denoted by a horizontal line above the spectra, after correction for multiple comparisons across neighboring frequency bins using an FDR procedure. Inset: Probability distribution of IU durations within each language corpus, calculated for 50 ms bins and pooled across speakers.
We sought to confirm that this effect was not a result of an amplitude transient at the beginning of IU sequences. To this end, we repeated the analysis, submitting only IUs that followed an inter-IU interval below 1 s, which corresponded to 65% of the IUs in Wooi, and ∼80% of the IUs in the other five languages (Table 1). The consistency estimates at 1 Hz were still larger than expected under the null hypothesis that IUs lack a definite rhythmic structure (Figure S2).
Our results are consistent with preliminary characterizations of the temporal structure of IUs (Chafe, 1987, 2018; Jun, 2005). The direct time measurements we used obviate the pitfalls of length measurements in word count or syllable count (e.g., Chafe, 1994; Himmelmann et al., 2018; Jun, 2005; Pawley & Syder, 2000). The temporal structure of IUs cannot be inferred from reported word counts, because what constitutes a word varies greatly across languages (Box 2). Syllable count per IU may provide an indirect estimation of IU length, especially if variation in syllable duration is taken into account (e.g., Silber-Varod & Levy, 2014), but it does not capture information about the temporal structure of IU sequences.
Discussion
Neuroscientific studies suggest that neural oscillations participate in segmenting the auditory signal and encoding linguistic units during speech perception. Many studies focus on the role of oscillations in the theta range (3-5 Hz) and in the gamma range (>40 Hz). The levels of segmentation attributed to these ranges are the syllable level and the fine-grain encoding of phonetic detail, respectively (Giraud & Poeppel, 2012). Studies identify also slower oscillations, in the delta range (<2 Hz), and have attributed them to segmentation at the level of phrases, both prosodic and semantic/syntactic (Bonhage, Meyer, Gruber, Friederici, & Mueller, 2017; Bourguignon et al., 2013; Ding, Melloni, Zhang, Tian, & Poeppel, 2016; Gross et al., 2013; Keitel, Ince, Gross, & Kayser, 2017; Meyer, Henry, Gaston, Schmuck, & Friederici, 2017; Park et al., 2015). Consistently, other findings demonstrated a decrease in high-frequency neural activity at points of semantic/syntactic completion (Nelson et al., 2017) or natural pauses between phrases (Hamilton, Edwards, & Chang, 2018). This pattern of results yields a slow modulation aligned to phrase structure. We harness linguistic theory to offer a conceptual framework for such slow modulations. We quantify the phase consistency between the slow modulations and IU onsets, and for the first time demonstrate that prosodic units with established functions in cognition give rise to a low-frequency rhythm in the auditory signal available to listeners.
Previous research has proposed to dissociate delta activity that represents acoustically-driven segmentation following prosodic phrases from delta activity that represents knowledge-based segmentation of semantic/syntactic phrases (Meyer, 2017). From the perspective of studying the temporal structure of spontaneous speech, we suggest that the distinction maintained between semantic/syntactic and prosodic phrasing might be superficial. That is because the semantic/syntactic building blocks always appear within prosodic phrases in natural language use (Auer, Couper-Kuhlen, & Müller, 1999; Hopper, 1987; Kreiner & Eviatar, 2014; Mithun, 2009). Studies investigating semantic/syntactic building blocks often compare the temporal dynamics of intact grammatical structure to word lists or grammatical structure in an unfamiliar language (e.g., Bonhage et al., 2017; Ding et al., 2016; Nelson et al., 2017). We argue that such studies need to incorporate a possibility that ongoing processing dynamics might reflect perceptual chunking, owing to the ubiquity of prosodic segmentation cues in natural language experience. This possibility is further supported by the fact that theoretically-defined semantic/syntactic boundaries are known to enhance the perception of prosodic boundaries. In a study that investigated the role of syntactic structure in guiding the perception of prosody in naturalistic speech (Cole, Mo, & Baek, 2010), syntactic structure was found to make an independent contribution to the perception of prosodic grouping. Another study equated prosodic boundary strength experimentally (controlling in a parametric fashion word duration, pitch contour, and following-pause duration), and found the same result: semantic/syntactic completion contributed to boundary perception (Buxó-Lugo & Watson, 2016). Even studies that use visual serial word presentation paradigms rather than auditory stimuli are not immune to an interpretation of prosodically-guided perceptual chunking, which is known to affect silent reading (for a review, see Breen, 2014; Fodor, 1998).
Independent of whether delta activity in the brain of the listener represents acoustic landmarks, abstract knowledge or the prosodically-mediated embodiment of abstract knowledge (Kreiner & Eviatar, 2014), our results point to another role for slow rhythmic brain activity (1 Hz). We find that orthogonal to divergent grammatical systems, speakers and speech mode, speakers express their developing ideas at a rate of approximately 1 Hz. Previous studies have shown that in the brains of listeners, a wide network interacts with low-frequency auditory-tracking activity, suggesting an interface of prediction and attention-related processes, memory and the language system (Kayser, Ince, Gross, & Kayser, 2015; Keitel et al., 2017; Park et al., 2015; Piai et al., 2016; Schroeder & Lakatos, 2009). We expect that via such low-frequency interactions, this same network constraints spontaneous speech production, orchestrating the management and communication of conceptual foci (Chafe, 1994).
Acknowledgement
MI is supported by the Humanities Fund PhD program in Linguistics and the Jack, Joseph and Morton Mandel School for Advanced Studies in the Humanities. Our work would not have been possible without the tremendous efforts carried by the creators of the corpora and their teams. We deeply thank them and the many people they recorded all over the world.
Appendix A: Supplementary Material
Data
We analyzed data from six corpora of spontaneous speech, that fall into two groups.
The first group consists of corpora in English, Hebrew and Russian, transcribed and segmented by professional teams, each working on their native language: The Santa Barbara Corpus of Spoken American English (Du Bois et al., 2005), The Haifa Corpus of Spoken Hebrew (Maschler et al., 2017), and The Prosodically Annotated Corpus of Spoken Russian (Kibrik & Podlesskaya, 2014). We constructed a subsample from each of the three corpora according to the following protocol: 10 recordings of conversations were sampled randomly from the English and Hebrew corpora. The Russian corpus includes narratives rather than conversations, so 20 recordings were sampled randomly to cover a comparably wide number of speakers. The English and Hebrew corpora include transcriptions segmented into intonation units (IUs) that are not timestamped to the audio file. Author MI manually measured the IU-onset and -offset times in the first 30 seconds of each recording, continuing until the next speaker change in each recording. Pauses between IUs were not considered as part of the neighboring IUs’ duration, in deviation from the transcription guidelines of Du Bois et al. (1992). Spectrogram analyses in Praat 6.0.23 (Boersma & Weenink, 2016) accompanied this procedure, including pitch and intensity contours produced by Praat’s default settings. Measurements in milliseconds were entered into ELAN 4.9.4 (Wittenburg, Brugman, Russel, Klassman, & Sloetjes, 2006), on separate tiers for different speakers. The Russian corpus includes the onset and offset times of IUs, measured in a similar fashion. The 20 sampled recordings were analyzed in full. Finally, in English and Hebrew, additional information was recorded for each IU, but was not further analyzed: the intonation contour attributed by the corpus constructors, the position within the turn, and a categorization to type of IU – Fragmentary, Regulatory or Substantive (Chafe, 1994).
The second group consists of corpora in Papuan Malay, Wooi and Yali: The DoBeS Summits-PAGE Collection of Papuan Malay (Himmelmann & Riesberg, 2016), The DoBeS Wooi Documentation (Kirihio et al., 2015), and The DoBes Yali Documentation (Riesberg et al., 2016). Papuan Malay (ISO 639-3 code: pmy) is the lingua franca of West Papua, the western half of the island of New Guinea governed by Indonesia. Wooi (ISO 639-3 code: wbw) and Yali (ISO 639-3 code: yac) belong to different language families, have very different grammatical profiles, and are spoken in different regions in West Papua – on the coast and in the highlands, respectively. All three corpora include retellings of the plot of a silent film by a narrator to an interlocuter. More data on these languages and corpora can be found in Himmelmann et al., (2018), a study that was dedicated to the ability to identify IUs in unfamiliar languages. Researchers with varying degrees of familiarity with the languages transcribed them, and the agreement between them was quantified and found to be equally high to the degree of agreement in their segmentation of familiar languages. Nikolaus Himmelmann, kindly provided us with the consensus transcriptions that the team constructed, and that were already entered into ELAN (Wittenburg et al., 2006) and time-aligned to the recording. We retrieved the respective audiofiles from the DoBeS archive. The Eastern Indonesia data in Himmelmann et al. (2018) included an additional language that we did not attempt to analyze because it included only two recordings. Additionally, we could not retrieve the audiofile of one of the original 6 narratives from the Yali consensus set, so only 5 Yali recordings were analyzed.
Data were extracted from all the ELAN files with the aid of custom-written scripts in R 3.3.3 (R Core Team, 2017) to data frames including IU onset and offset times, the computed duration, and the speaker of each IU. The remaining analyses were performed in MATLAB (MATLAB version 9.2.0.5, R2017a), using custom-written scripts and the FieldTrip toolbox (Oostenveld, Fries, Maris, & Schoffelen, 2011).
Speech envelope computation
The amplitude envelope was computed for the audio file of each speech segment following the methods used in several studies demonstrating neural entrainment to the envelope (e.g., Gross et al., 2013), methods that partly follow Chandrasekaran et al. (2009).
Speech segments that were recorded in stereo were converted to mono by averaging the channels. Recordings with a sampling rate above 20 kHz were downsampled to 20 kHz. Speech segments were band-pass filtered into 10 bands between 200 Hz and half the audio file’s sampling frequency, with cut-off points designed to be equidistant on the human cochlear map. Amplitude envelopes for each band (the narrowband envelopes) were computed as absolute values of the Hilbert transform. These narrowband envelopes were downsampled to 1000 Hz and subsequently averaged, yielding the wideband envelope. The wideband envelope was smoothed using a 50 ms sliding Gaussian filter and divided by its maximal value to be on a scale of 0-1.
Appendix B: Additional Analyses
Characterization of the temporal structure of another prosodic unit
We conducted an identical analysis for prosodic units in an additional corpus, Rhapsodie: A Prosodic-Syntactic Treebank for Spoken French (Lacheret et al., 2014). This corpus included a semi-automatic & timestamped segmentation into prosodic units. The segmentation used a different combination of rhythmic and melodic cues, most notably relying on the notions of prosodic prominence and disfluency and assuming that these define different degrees of prosodic cohesion. The different degrees of prosodic cohesion yield a hierarchy of prosodic units, from which we extracted the level of rhythmic groups, delimited by weak or strong prominences on phonological word-final syllables that are non-disfluent. Below rhythmic groups in the hierarchy of prosodic units are metrical feet (delimited by all non-disfluent prominences, whether or not they are word-final, and hence likely do not correspond to IU final boundaries). Above rhythmic groups in the hierarchy of prosodic units are intonation packages (delimited by strong prominences). The gradient prominence annotation is based on unquantified auditory inspection. It is likely based on pitch, volume, and duration variations, the latter of which might be compatible with the notion of word-final lengthening used in the segmentation of IUs. We selected the level of rhythmic groups to impose the minimal requirement that the prominence be phonological word-final, noting that the prominence annotation does not entail that all phonological words end in prominence. Information regarding the French sample may be found in Table 2.
Summary information on the sample of speech segments used in the analysis of the French corpus.
The analysis of the French corpus provided the unique opportunity to test the notion that prosodic boundaries arise, among other criteria, from the fast-slow dynamic of syllables. Unlike the six IU-segmented corpora used in the main analysis, the French corpus includes a time-aligned annotation of syllables. Therefore, we could use actual syllable onset times for the permutation of IU time-courses instead of indirectly estimating syllables onsets based on troughs in the envelope. The results of this analysis are presented in Figure S1.
Phase-consistency analysis results. Shaded regions (red) denote ±SEM across speakers. Significance was assessed for the 1 Hz frequency component and is denoted by a horizontal line above the peak (uncorrected). Inset: Probability distribution of IU durations within the language corpus, calculated for 50 ms bins and pooled across speakers.
(item corresponding to Figure 2). Phase-consistency analysis results. Shaded regions denote ±SEM across speakers. Significance was assessed for the 1 Hz frequency component of each spectrum and is denoted by a horizontal line above the peak (uncorrected). Inset: Probability distribution of all inter-IU-interval durations within each language corpus, calculated for 50 ms bins and pooled across speakers.