Elsevier

Hearing Research

Volume 305, November 2013, Pages 113-134
Hearing Research

Review
Syllabic (∼2–5 Hz) and fluctuation (∼1–10 Hz) ranges in speech and auditory processing

https://doi.org/10.1016/j.heares.2013.08.017Get rights and content

Abstract

Given recent interest in syllabic rates (∼2–5 Hz) for speech processing, we review the perception of “fluctuation” range (∼1–10 Hz) modulations during listening to speech and technical auditory stimuli (AM and FM tones and noises, and ripple sounds). We find evidence that the temporal modulation transfer function (TMTF) of human auditory perception is not simply low-pass in nature, but rather exhibits a peak in sensitivity in the syllabic range (∼2–5 Hz). We also address human and animal neurophysiological evidence, and argue that this bandpass tuning arises at the thalamocortical level and is more associated with non-primary regions than primary regions of cortex. The bandpass rather than low-pass TMTF has implications for modeling auditory central physiology and speech processing: this implicates temporal contrast rather than simple temporal integration, with contrast enhancement for dynamic stimuli in the fluctuation range.

This article is part of a Special Issue entitled “Communication Sounds and the Brain: New Directions and Perspectives”.

Introduction

A theme of this special issue is the role of vocalizations as stimuli in auditory neuroscience. Vocalizations can be considered as part of a larger class of communication signals used by other species and man-made devices, which by necessity exhibit modulations. As Picinbono (1997) states: “Let us remember that a purely monochromatic signal such as a cos(ωt + ϕ) cannot transmit any information. For this purpose, a modulation is required, …” Likewise, unmodulated noise cannot transmit any information, so we can expect on a priori grounds a link between AM/FM (amplitude/frequency modulation) studies and speech studies (Rosen, 1992). In fact, the same auditory regions involved in speech processing are strongly activated by AM/FM sounds. For example, the non-primary cortical areas most activated for AM/FM processing in the syllabic (∼2–5 Hz) range are also implicated in pathways for intelligible speech (Scott et al., 2006, Hall, 2012) (Section 4). Thus, we have chosen as our contribution to “Communication Sounds in the Brain” a new consideration of AM/FM processing with relevance to speech. A premise of this review is that careful study of AM/FM results will lead to insights for speech processing.

Recent reviews (Joris et al., 2004, Malone and Schreiner, 2010) cover well the periodicity pitch range surrounding voice fundamental frequency (F0, ∼50–500 Hz), and there is a well-established speech processing literature on extracting F0. The roughness range (∼25–125 Hz) has also been studied extensively and treated well in recent reviews. However, the slower ranges of AM/FM (to be termed the ‘fluctuation’ range, ∼1–10 Hz) are traditionally understudied. We will also find that the neural systems most strongly implicated in fluctuation perception – the ‘belt’ and ‘parabelt’ regions of the CNS – are far less studied than ‘core’ regions (as commented by Goldstein and Knight, 1980, Hall, 2005). In parallel, the slower aspects of speech – syllabic time scales, prosody, stress, intonation, emotional aspects, etc. – are understudied relative to the spectrotemporally-detailed aspects for phonetic purposes. In further parallel, algorithmic approaches to speech processing have only rarely (and more recently) focused on longer time scales. Given the recent interest in syllabic time scales (∼2–5 Hz) for speech perception, human neurophysiology, and computer speech processing (Hall, 2005, Greenberg, 2006, Ghitza and Greenberg, 2009, Giraud and Poeppel, 2012, Obleser et al., 2012, Peelle and Davis, 2012), we have chosen to review these time scales in more basic studies of auditory perception and physiology. This is not a comprehensive review of AM/FM sounds, rather a focus on the fluctuation (∼1–10 Hz) range and the corresponding time scales of speech. Before embarking on our review, we offer our thoughts on the theme for this special issue.

In the exploratory phase of empirical data gathering, speech is a useful stimulus because, amongst other things, it elicits robust activations throughout the auditory nervous system. These yield overall observations concerning directly the stimulus set of interest, which our eventual models must explain. However, the empirical observations available to us – a variety of auditory stations in various species under various anesthetics, using various particular synthetic or natural speech sounds for a given study – do not allow us to easily perceive the essential patterns to be included in the model building exercise. Even a complete catalog of each auditory station responding to each possible phoneme or speech sound would likely remain inadequate. On the other hand, technical stimuli (AM/FM) can be arrayed systematically according to a single parameter (modulation frequency) and related directly to communication theory and signals/systems theory. This obviously accelerates the model building exercise during the difficult early phases, when even the overall layout and essential features of the models are still in question. However, we find that speech is an essential stimulus again in the final stages of model building – the final selection of model structure and specification of model parameters. Since speech is taken to be the stimulus set of interest, the final least-squares or other fit should be determined by the use of speech stimuli whenever possible. We note in this context that speech is usually ‘sufficiently exciting’, which is a mathematical requirement in system identification (Ljung, 1999), and essentially means that speech is sufficiently rich in spectrotemporal features to cover the signal space of interest. In some contexts, where the modeler has already chosen a certain model structure – for example, the spectro-temporal receptive field (STRF) – then one can usefully skip straight to the use of speech as a stimulus for final least-squares-fit of model parameters. But as we seek more realistic models of CNS function, with new aspects to exploit for speech processing applications, we may require ongoing use of technical stimuli.

We are at a point in history where we have good models of the auditory periphery for most speech processing purposes. By a ‘good’ model is meant one with broad explanatory power, accurate predictions for arbitrary inputs, and as few parameters as possible (the principle of parsimony). Adequate models of the cochlear nucleus appear to be arriving or nearly on the horizon, but this still places us some distance from a complete computational model of the auditory CNS. Before arriving at a full physiological model, we can hope to arrive at simpler models which are considerably abstracted from actual physiological details (yet including as much physiological insight as possible). In order to approach the modeling problem for auditory CNS, certain simplifications are useful or necessary at the early stages. First, we can ignore binaural/spatial aspects in a first model for speech processing purposes (other than the multispeaker situation, where binaural cues are essential, Cherry, 1953, Bregman, 1990, Schimmel et al., 2008). However, more severe simplifications appear to be required in order to relate psychophysics, human neuroscience, animal neurophysiology, and computer speech processing together in a comprehensible way.

We suggest that studies of modulated (AM/FM) sounds may serve as an intermediate stage, before final model specifications, in the long-term goals of speech neurophysiology and modeling. Extensive bodies of work are already available concerning AM/FM sounds in communication theory, signals and systems theory, human psychophysics, and animal neurophysiology. For a given modulation type, there is a systematic space of signals controlled by a single parameter (modulation frequency), allowing unambiguous mapping across research domains into a single orderly framework. While this is still not a sufficiently complicated space to understand all aspects of speech, it is a long way in the right direction compared to clicks and tones, the traditional technical stimuli. The fact that many workers have adopted modulation filter banks (Kay and Matthews, 1972, Dau et al., 1997) or related approaches (Greenberg and Kingsbury, 1997), i.e. adding to the spectral and temporal dimensions a modulation dimension (Atlas and Shamma, 2003, Singh and Theunissen, 2003), indicates the utility of having a stimulus set which can be systematically ordered along the modulation frequency axis (as opposed to various random stimuli).

AM/FM sounds also have the advantage of lacking spectral structure. We noted that severe simplification is often required at early model building stages, such as ignoring binaural/spatial processing. We can also ignore spectral-domain pitch processing for AM/FM sounds below the range where periodicity pitch is elicited (below ∼50 Hz), where the resulting spectral structure is not resolvable by the ear. Thus, we can ignore two-tone interaction, lateral inhibition, and other complexities of cross-spectral processing. Results discussed below (Sections 2 Basic auditory percepts for AM and FM sounds, 4 Human neurophysiology) indicate that spectro-temporal processing is to a first approximation separable, such that spectral and temporal processing studied separately can be recombined to predict spectro-temporal results. Before auditory CNS models will become available for arbitrary signals, preliminary models to account for temporal stimuli are likely to appear. Since speech can be understood by temporal cues alone (Shannon et al., 1995), this further suggests that study of temporal processing in isolation from complex spectral structure may serve as a first approximation for preliminary models. However, as we argued above, these models should then be tested for parameter specification by use of natural speech signals when possible.

Finally, there is abundant evidence that the same basic auditory percepts experienced during listening to modulated sounds (‘fluctuation’, ‘roughness’, ‘periodicity pitch’) are also experienced when the same modulation frequencies are present in the speech signal. Voice fundamental frequency (F0), from glottal pulse rate, elicits the same basic pitch sensation as periodic clicks or AM/FM stimuli of the same frequency. Glottal shimmer (AM) and jitter (FM) result in roughness range (∼25–125 Hz) modulations, and correspondingly elicit a perception of roughness in the voice (Wendhal, 1966a, Wendhal, 1966b, Coleman, 1971). However, tremulo (AM) and vibrato (FM) in the voice occur below the roughness range (∼2–20 Hz, usually ∼7 Hz), and generally sound pleasing and form part of musical technique (Seashore, 1936, Potter et al., 1947). Thus, perception of AM/FM sounds directly predicts perception of vocalizations, in so far as the same modulation rates are present. In Section 2 we consider these basic percepts for AM/FM sounds, where it should be kept in mind that these are the same basic percepts experienced during listening to vocalization stimuli.

Section snippets

Basic auditory percepts for AM and FM sounds

Before narrowing our focus to the fluctuation range (∼1–10 Hz), we set up the context of the full range of AM/FM percepts.

AM detectability

Having established the overall percepts for AM and FM sounds, we focus in this section on the fluctuation range (∼1–10 Hz). Specifically, we will survey a body of evidence in support of the central claim of this review – that human auditory perception exhibits a tuning to modulations occurring within the fluctuation range, peaking broadly at ∼2–5 Hz. This is similar to the typical syllabic rate of speech, to be discussed in Section 3.3.

To preview this claim, we make another plot in the form of

Human neurophysiology

Human auditory EEG, MEG, PET, and fMRI studies of modulated sounds focus overwhelmingly on the periodicity pitch and roughness ranges. Scalp EEG and MEG studies in the 1980s and 1990s, later followed by fMRI studies, focused on the responses to 40-Hz repetitive or AM stimuli. This was driven initially by clinical and basic research interest (Galambos et al., 1981, Sheer, 1989), and then by the post-Singer (1992) interest in synchrony and 40-Hz. Overwhelmingly, these studies only included AM

Animal neurophysiology

Based on the human fMRI evidence, two expectations for (unanesthetized) primate cortex are: 1. Primary regions exhibit some ∼2–5 Hz AM/FM tuning, but also tuning up to ∼20 Hz or more, with the peak at ∼5–10 Hz. 2. Non-primary regions, particularly those lying lateral to HG, are generally slower and appear to more strongly express the ∼2–5 Hz peak AM/FM tuning.

Several excellent reviews exist for general AM/FM results in animal neurophysiology (Kay, 1982, Langner, 1992, Joris et al., 2004, Wang

Signal processing significance

We conclude by briefly considering the signal processing significance of the observed fluctuation range (∼1–10 Hz) tuning. In speech processing for ASR, the task of separating syllabic or phonemic units from the continuous speech stream is known as automatic segmentation, and usually relies on measures of spectral change (Sakai and Doshita, 1963, Tappert, 1972) or AM maxima/minima (Mermelstein, 1975, Reddy, 1976, Zwicker et al., 1979). Note that these measures are applied to the output of the

Acknowledgments

We thank Brian Malone for helpful discussion and comments on the manuscript, and Dora Hermes for helpful discussion at early stages. This work was supported by NINDS fellowship F32-NS061616 (EE) and NIH grants R00-NS065120, R01-DC012379, DP2 OD008627 (EC).

References (205)

  • D.M. Green

    Minimum integration time

  • J.H. Kaas et al.

    Auditory processing in primate cerebral cortex

    Curr. Opin. Neurobiol.

    (1999)
  • H.W. Ades

    Central auditory mechanisms

  • S.J. Aiken et al.

    Human cortical responses to the speech envelope

    Ear Hear.

    (2008)
  • L.M. Aitkin

    The Auditory Midbrain: Structure and Function in the Central Auditory Pathway

    (1986)
  • L.M. Aitkin et al.

    Frequency representation in auditory cortex of the common marmoset (Callithrix jacchus jacchus)

    J. Comp. Neurol.

    (1986)
  • R.A. Andersen et al.

    The thalamocortical and corticothalamic connections of AI, AII, and the anterior auditory field (AAF) in the cat: evidence for two largely segregated systems of connections

    J. Comp. Neurol.

    (1980)
  • T. Arai et al.

    Intelligibility of speech with filtered time trajectories of spectral envelopes

    (1996)
  • T. Arai et al.

    Syllable intelligibility for temporally filtered LPC cepstral trajectories

    J. Acoust. Soc. Am.

    (1999)
  • B.S. Atal

    Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

    J. Acoust. Soc. Am.

    (1974)
  • L.E. Atlas et al.

    Joint acoustic and modulation frequency

    EURASIP J. Adv. Signal Process.

    (2003)
  • S.P. Bacon et al.

    Temporal modulation transfer functions in normal-hearing and hearing-impaired listeners

    Audiology

    (1985)
  • S.H. Bartley

    Some factors in brightness discrimination

    Psychol. Rev.

    (1939)
  • S. Baumann et al.

    Orthogonal representation of sound dimensions in the primate midbrain

    Nat. Neurosci.

    (2011)
  • S. Baumann et al.

    A unified framework for the organization of the primate auditory cortex

    Front. Syst. Neurosci.

    (2013)
  • D. Bendor et al.

    Neural response properties of primary, rostral, and rostrotemporal core fields in the auditory cortex of marmoset monkeys

    J. Neurophysiol.

    (2008)
  • H. Berger

    Über das Elektrenkephalogramm des Menschen. II

    J. Psychol. Neurol.

    (1930)
  • A. Bieser et al.

    Auditory responsive cortex in the squirrel monkey: neural responses to amplitude-modulated sounds

    Exp. Brain Res.

    (1996)
  • F.A. Bilsen et al.

    Repetition pitch mediated by temporal fine structure at dominant spectral regions

    Acustica

    (1967)
  • F.A. Bilsen et al.

    Atonal periodicity sensation for comb filtered noise signals

  • J.R. Binder et al.

    Functional magnetic resonance imaging of human auditory cortex

    Ann. Neurol.

    (1994)
  • A. Boemio et al.

    Hierarchical and asymmetric temporal sensitivity in human auditory cortices

    Nat. Neurosci.

    (2005)
  • A.S. Bregman

    Auditory Scene Analysis: the Perceptual Organization of Sound

    (1990)
  • F. Bremer

    Analyse oscillographique des réponses sensorielles des écorces cérébrales et cérébelleuse

    Rev. Neurol.

    (1952)
  • E.M. Burns et al.

    Nonspectral pitch

    J. Acoust. Soc. Am.

    (1976)
  • E.M. Burns et al.

    Played-again SAM: further observations on the pitch of amplitude-modulated noise

    J. Acoust. Soc. Am.

    (1981)
  • M.B. Calford et al.

    Ascending projections to the medial geniculate body of the cat: evidence for multiple, parallel auditory pathways through thalamus

    J. Neurosci.

    (1983)
  • E.F. Chang et al.

    Development of spectral and temporal response selectivity in the auditory cortex

    Proc. Natl. Acad. Sci. U. S. A.

    (2005)
  • H.-T. Chang

    Some observations on the excitability changes of cortical and subcortical neurons and their possible significance in the process of conditioning

    Electroencephalogr. Clin. Neurophysiol. Suppl.

    (1960)
  • E.C. Cherry

    Some experiments on the recognition of speech, with one and with two ears

    J. Acoust. Soc. Am.

    (1953)
  • T.-S. Chi et al.

    Spectro-temporal modulation transfer functions and speech intelligibility

    J. Acoust. Soc. Am.

    (1999)
  • P. Clark et al.

    Time-frequency coherent modulation filtering of nonstationary signals

    IEEE Trans. Signal Process.

    (2009)
  • R.F. Coleman

    Effect of waveform changes upon roughness perception

    Folia Phoniatr.

    (1971)
  • N.E. Crone et al.

    Functional mapping of human sensorimotor cortex with electrocorticographic spectral analysis. II. Event-related synchronization in the gamma band

    Brain

    (1998)
  • G. Curio

    Linking 600-Hz “spikelike” EEG/MEG wavelets (“sigma-bursts”) to cellular substrates: concepts and caveats

    J. Clin. Neurophysiol.

    (2000)
  • C.W. Darrow et al.

    Relation of electroencephalogram to photometrically observed vasomotor changes in the brain

    J. Neurophysiol.

    (1945)
  • T. Dau et al.

    Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers

    J. Acoust. Soc. Am.

    (1997)
  • H. Davis et al.

    Changes in human brain potentials during the onset of sleep

    Science

    (1937)
  • D.A. Depireux et al.

    Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex

    J. Neurophysiol.

    (2001)
  • R. Drullman et al.

    Effect of reducing slow temporal modulations on speech reception

    J. Acoust. Soc. Am.

    (1994)
  • Cited by (52)

    • Generative models, linguistic communication and active inference

      2020, Neuroscience and Biobehavioral Reviews
    • Converging intracortical signatures of two separated processing timescales in human early auditory cortex

      2020, NeuroImage
      Citation Excerpt :

      Macroscopic features related to rhythmic brain activity could hence reflect microscopic anomalies at the neuronal level and, at least in some cases, be related to specific sets of susceptibility genes (Ramamoorthi and Lin, 2011; Gao and Penzes, 2015; Benítez-Burraco and Murphy, 2016), further enhancing their interest for both basic and clinical research. In humans, multiple pieces of evidence suggest that auditory perception, and its associated brain activity, is not a scale-free process, but presents at least two separated frequency bands, approximately located near the classically defined delta-theta (1–8 ​Hz) and gamma (30–60 ​Hz) bands, where perception and brain entrainment surpass those observed in intermediate frequencies (Poeppel, 2003; Boemio et al., 2005; Luo and Poeppel, 2012; Edwards and Chang, 2013; Ross et al., 2014; Teng et al., 2016, 2017). In this work, we present intracranial data recorded from ten epilepsy patients while they were listening to short sentences.

    • Neural envelope encoding predicts speech perception performance for normal-hearing and hearing-impaired adults

      2018, Hearing Research
      Citation Excerpt :

      The envelope of the modulated SWN was sinusoidally modulated at a 4-Hz rate and the envelope spectrum of the ISTS shows a maximum between 2 and 8 Hz (Holube et al., 2010). In continuous speech, envelope modulations ranging from 2 to 10 Hz signal the occurrence of syllables (Chait et al., 2015; Edwards and Chang, 2013), which are speech units that play an important role in speech perception (e.g., Greenberg et al., 2003). Research of Doelling et al. (2014), for instance, indicates that listeners parse speech into syllable-sized chunks for further processing.

    View all citing articles on Scopus
    View full text