Abstract
Language and music are two human-unique capacities whose relationship remains debated. Some have argued for overlap in processing mechanisms, especially for structure processing. Such claims often concern the inferior frontal component of the language system located within ‘Broca’s area’. However, others have failed to find overlap. Using a robust individual-subject fMRI approach, we examined the responses of language brain regions to music stimuli, and probed the musical abilities of individuals with severe aphasia. Across four experiments, we obtained a clear answer: music perception does not engage the language system, and judgments about music structure are possible even in the presence of severe damage to the language network. In particular, the language regions’ responses to music are generally low, often below the fixation baseline, and never exceed responses elicited by non-music auditory conditions, like animal sounds. Further, the language regions are not sensitive to music structure: they show low responses to intact and structure-scrambled music, and to melodies with vs. without structural violations. Finally, in line with past patient investigations, individuals with aphasia who cannot judge sentence grammaticality perform well on melody well-formedness judgments. Thus the mechanisms that process structure in language do not appear to process music, including music syntax.
Introduction
To interpret language or appreciate music, we must understand how different elements—words in language, notes and chords in music—relate to each other. Parallels between the structural properties of language and music have been drawn for over a century (e.g., Riemann 1877, as cited in Swain 1995; Lindblom and Sundberg 1969; Fay 1971; Boiles 1973; Cooper 1973; Bernstein 1976; Sundberg and Lindblom 1976; Lerdahl and Jackendoff 1977, 1983; Roads and Wieneke 1979; Krumhansl and Keil 1982; Baroni et al. 1983; Swain 1995; cf. Jackendoff 2009; Temperley 2022). However, the question of whether music processing relies on the same mechanisms as those that support language processing continues to spark debate.
The empirical landscape is complex. A large number of studies have argued for overlap in structural processing based on behavioral (e.g., Fedorenko et al. 2009; Slevc et al. 2009; Hoch et al. 2011; Van de Cavey and Hartsuiker 2016; Kunert et al. 2016), ERP (e.g., Janata 1995; Patel et al. 1998; Koelsch et al. 2000), MEG (e.g., Maess et al. 2001), fMRI (e.g., Koelsch et al. 2002; Levitin and Menon 2003; Tillmann et al. 2003; Koelsch 2006; Kunert et al. 2015; Musso et al. 2015), and ECoG (e.g., Sammler et al. 2009, 2013; Rietmolen et al. 2022) evidence (see Tillman 2012; Kunert and Slevc 2015; LaCroix et al. 2016, for reviews). However, we would argue that no prior study has compellingly established reliance on shared syntactic processing mechanisms in language and music.
First, evidence from behavioral, ERP, and, to a large extent, MEG studies is indirect because they do not allow to unambiguously determine where neural responses originate (in ERP and MEG, this is due to the ‘inverse problem’; Tarantola 2004; Baillet et al. 2014).
Second, the bulk of the evidence comes from structure-violation paradigms. In such paradigms, responses to the critical condition—which contains an element that violates the rules of tonal music—are contrasted with responses to the control condition, where stimuli obey the rules of tonal music. (For language, syntactic violations, like violations of number agreement, are often used.) Because structural violations (across domains) constitute unexpected events, a brain region that responds more strongly to the structure-violation condition than the control (no violation) condition may support structure processing in music, but it may also reflect domain-general processes, like attention or error detection/correction (e.g., Bigand et al. 2001; Poulin-Charronat et al. 2005; Tillmann et al. 2006; Hoch et al. 2011; Perruchet and Poulin-Charronnat 2013) or low-level sensory effects (e.g., Bigand et al. 2014; Collins et al. 2014; cf. Koelsch et al. 2007). In order to argue that a brain region that shows a structure-violation > no violation effect supports structure processing in music, one would need to establish that this brain region i) is selective for structural violations and does not respond to unexpected non-structural (but similarly salient) events in music or other domains, and ii) responds to music stimuli even when no violation is present. This latter point is (surprisingly) not often discussed but is deeply important: if a brain region supports the processing of music structure, it should be engaged whenever music is processed (similar to how language areas respond robustly to well-formed sentences, in addition to showing sensitivity to violated linguistic expectations; e.g., Fedorenko et al. 2020). After all, in order to detect a structural violation, a brain region needs to process the structure of the preceding context, which implies that it should be working whenever a music stimulus is present. No previous study has established both of the properties above—selectivity for structural relative to non-structural violations and robust responses to music stimuli with no violations—for the brain regions that have been argued to support structure processing in music (and to overlap with regions that support structure processing in language). In fact, some studies that have compared unexpected structural and non-structural events in music (e.g., a timbre change) have reported similar neural responses in fMRI (e.g., Koelsch et al. 2002; cf. some differences in EEG effects – e.g., Koelsch et al. 2001). Relatedly, and in support of the idea that effects of music structure violations largely reflect domain-general attentional effects, meta-analyses of neural responses to unexpected events across domains (e.g., Corbetta and Shulman 2002; Fouragnan et al. 2018; Corlett et al. 2021) have identified regions that grossly resemble those reported in studies of music structure violations (see Fedorenko and Varley 2016 for discussion).
Third, most prior fMRI (and MEG) investigations have relied on comparisons of group-level activation maps. Such analyses suffer from low functional resolution (e.g., Nieto-Castañón and Fedorenko 2012; Fedorenko 2021), especially in cases where the precise locations of functional regions vary across individuals, as in the association cortex (Fischl et al. 2008; Frost and Goebel 2012; Tahmasebi et al. 2012; Vazquez-Rodriguez et al. 2019). Thus, observing activation overlap at the group level does not unequivocally support shared mechanisms. Indeed, studies that have used individual-subject-level analyses have reported a low or no response to music in the language-responsive regions (Fedorenko et al. 2011; Rogalsky et al. 2011; Deen et al. 2015).
Fourth, the interpretation of some of the observed effects has relied on the so-called ‘reverse inference’ (Poldrack 2006, 2011; Fedorenko 2021), where function is inferred from a coarse anatomical location: for example, some music-structure-related effects observed in or around ‘Broca’s area’ have been interpreted as reflecting the engagement of linguistic-structure-processing mechanisms (e.g., Maess et al. 2001; Koelsch et al. 2002) given the long-standing association between ‘Broca’s area’ and language, including syntactic processing specifically (e.g., Caramazza and Zurif 1976; Friederici et al. 2006). However, this reasoning is not valid: Broca’s area is a heterogeneous region, which houses components of at least two functionally distinct brain networks (Fedorenko et al. 2012; Fedorenko and Blank 2020): the language-selective network, which responds during language processing, visual or auditory, but does not respond to diverse non-linguistic stimuli (Fedorenko et al. 2011; Monti et al. 2009, 2012; see Fedorenko and Varley 2016 for a review) and the domain-general executive control or ‘multiple demand (MD)’ network, which responds to any demanding cognitive task and is robustly modulated by task difficulty (Duncan 2010, 2013; Fedorenko et al. 2013; Assem et al. 2020). As a result, here and more generally, functional interpretation based on coarse anatomical localization is not justified.
Fifth, many prior fMRI investigations have not reported the magnitudes of response to the relevant conditions and only examined statistical significance maps for the contrast of interest (e.g., a whole brain map showing voxels that respond reliably more strongly to melodies with vs. without a structural violation, and to sentences with vs. without a structural violation). Response magnitudes of experimental conditions relative to a low-level baseline and to each other are critical for interpreting a functional profile of a brain region (see e.g., Chen et al. 2017, for discussion). For example, a reliable violation > no violation effect in music (similar arguments apply to language) could be observed when both conditions elicit above-baseline responses, and the violation condition elicits a stronger response (Figure 1A left bar graph)—a reasonable profile for a brain region that supports music processing and is sensitive to the target structural manipulation. However, a reliable violation > no violation effect could also be observed when both conditions elicit below-baseline responses, and the violation condition elicits a less negative response (Figure 1A middle bar graph), or when both conditions elicit low responses—in the presence of a strong response to stimuli in other domains—and the between-condition difference is small (Figure 1A right bar graph; note that with sufficient power even very small effects can be highly reliable, but this does not make them theoretically meaningful; e.g., Cumming 2012; Sullivan and Feinn, 2012). The two latter profiles, where a brain region is more active during silence than when listening to music, or when the response is overall low and the effect of interest is minuscule, would be harder to reconcile with a role of this brain region in music processing (see also the second point above).
Similarly, with respect to the music-language overlap question, a reliable violation > no violation effect for both language and music could be observed in a brain region where sentences and melodies with violations elicit similarly strong responses, and those without violations elicit lower responses (Figure 1B left bar graph); but it could also arise in a brain region where sentences with violations elicit a strong response, sentences without violations elicit a lower response, but melodies elicit an overall low response, with the violation condition eliciting a higher response than the no-violation condition (Figure 1B right bar graph). Whereas in the first case, it may be reasonable to argue that the brain region in question supports some computation that is necessary to process structure violations in both domains, such interpretation would not be straightforward in the second case. In particular, given the large main effect of language > music, any account of possible computations supported by such a brain region would need to explain this difference instead of simply focusing on the presence of a reliable effect of violation in both domains. In summary, without examining the magnitudes of response, it is not possible to distinguish among many, potentially very different, functional profiles, without which formulating hypotheses about a brain region’s computations is precarious.
Aside from the limitations above, to the best of our knowledge, all prior brain imaging studies have used a single manipulation in one set of materials and one set of participants. To compellingly argue that a brain region supports (some aspects of) structural processing in both language and music, it is important to establish both the robustness of the key effect by replicating it with a new set of experimental materials and/or in a new group of participants, and its generalizability to other contrasts between conditions that engage the hypothesized computation and ones that do not. For example, to argue that a brain region houses a core syntactic mechanism needed to process hierarchical relations and/or recursion in both language and music (e.g., Patel 2003; Fadiga et al. 2009; Roberts 2012; Koelsch et al. 2013; Fitch and Martins 2014), one would need to demonstrate that this region i) responds robustly to diverse structured linguistic and musical stimuli (which all invoke the hypothesized shared computation), ii) shows replicable responses across materials and participants, and iii) is sensitive to more than a single manipulation targeting the hypothesized computations specifically, as needed to rule out paradigm-/task-specific accounts (e.g., structured vs. unstructured stimuli, stimuli with vs. without structural violations, stimuli that are more vs. less structurally complex—e.g., with long-distance vs. local dependencies, adaptation to structure vs. some other aspect of the stimulus, etc.).
Finally, the neuropsychological patient evidence is at odds with the idea of shared mechanisms for processing language and music. If language and music relied on the same syntactic processing mechanism, individuals impaired in their processing of linguistic syntax should also exhibit impairments in musical syntax. Although some prior studies report subtle musical deficits in patients with aphasia (Patel et al. 2008a; Sammler et al. 2011), the evidence is equivocal, and many aphasic patients appear to have little or no difficulties with music, including the processing of music structure (Luria et al. 1965; Brust 1980; Marin 1982; Basso and Capitani 1985; Polk and Kertesz 1993; Slevc et al. 2016; Faroqi-Shah et al. 2020; Chiapetta et al. 2022; cf. Omigie and Samson 2014 and Sihvonen et al. 2017 for discussions of evidence that musical training may lead to better outcomes following brain damage/resection). Similarly, children with Specific Language Impairment (now called Developmental Language Disorder)—a developmental disorder that affects several aspects of linguistic and cognitive processing, including syntactic processing (e.g., Bortolini et al. 1998; Bishop and Norbury 2002)—show no impairments in musical processing (Fancourt 2013; cf. Jentschke et al. 2008). In an attempt to reconcile the evidence from acquired and developmental disorders with claims about structure-processing overlap based on behavioral and neural evidence from neurotypical participants, Patel (2003, 2008, 2012; see Slevc and Okada 2015, Patel and Morgan 2017, and Asano et al. 2021 for related proposals) put forward a hypothesis whereby the representations that mediate language and music are stored in distinct brain areas, but the mechanisms that perform online computations on those representations are partially overlapping. We return to this idea in the Discussion.
To bring clarity to this ongoing debate, we conducted three fMRI experiments with neurotypical adults, and a behavioral study with individuals with severe aphasia. For the fMRI experiments, we took an approach where we focused on the ‘language network’—a well-characterized set of left frontal and temporal brain areas that selectively support linguistic processing (e.g., Fedorenko et al. 2011) and asked whether any parts of this network show responses to music and sensitivity to music structure. In each experiment, we used an extensively validated language ‘localizer’ task based on the reading of sentences and nonword sequences (Fedorenko et al. 2010; see Scott et al. 2017 and Malik-Moraleda, Ayyash et al. 2022 for evidence that this localizer is modality-independent) to identify language-responsive areas in each participant individually. Importantly, these areas have been shown, across dozens of brain imaging studies, to be robustly sensitive to linguistic syntactic processing demands in diverse manipulations (e.g., Keller et al. 2001; Röder et al. 2002; Friederici 2011; Pallier et al. 2011; Bautista and Wilson 2016, among many others)—including when defined with the same localizer as the one used here (e.g., Fedorenko et al. 2010, 2012a, 2020; Blank et al. 2016; Mollica et al. 2020; Shain, Blank et al. 2020; Shain et al. 2022)—and their damage leads to linguistic, including syntactic, deficits (e.g., Caplan et al. 1996; Dick et al. 2001; Wilson and Saygin 2004; Tyler et al. 2011; Wilson et al. 2012; Mesulam et al. 2014; Ding et al. 2020; Matchin and Hickok 2020, among many others). To address the critical research question, we examined the responses of these language areas to music, and their necessity for processing music structure. In Experiment 1, we included several types of music stimuli including orchestral music, single-instrument music, synthetic drum music, and synthetic melodies, a minimal comparison between songs and spoken lyrics, and a set of non-music auditory control conditions. We additionally examined sensitivity to structure in music across two structure-scrambling manipulations. In Experiment 2, we further probed sensitivity to structure in music using the most common manipulation, contrasting responses to well-formed melodies vs. melodies containing a note that does not obey the constraints of Western tonal music. And in Experiment 3, we examined the ability to discriminate between well-formed melodies and melodies containing a structural violation in three profoundly aphasic individuals across two tasks. Finally, in Experiment 4, we examined the responses of the language regions to yet another set of music stimuli in a new set of participants. Further, the participants were all native speakers of Mandarin, a tonal language, which allowed us to evaluate the hypothesis that language regions may play a greater role in music processing in individuals with higher sensitivity to linguistic pitch (e.g., Deutsch et al. 2006, 2009; Bidelman et al. 2011; Creel et al. 2018; Ngo et al. 2016; Liu et al. 2021).
Materials and methods
Participants
Experiments 1, 2, and 4 (fMRI)
48 individuals (age 18-51, mean 24.3; 28 female, 20 male) from the Cambridge/Boston, MA community participated for payment across three fMRI experiments (n=18 in Experiment 1; n=20 in Experiment 2; n=18 in Experiment 4; 8 participants overlapped between Experiments 1 and 2). 33 participants were right-handed and four left-handed, as determined by the Edinburgh handedness inventory (Oldfield 1971), or self-report (see Willems et al. 2014, for arguments for including left-handers in cognitive neuroscience research); the handedness data for the remaining 11 participants (one in Experiment 2 and 10 in Experiment 4) were not collected. All but one participant (with no handedness information) in Experiment 4 showed typical left-lateralized language activations in the language localizer task described below (as assessed by numbers of voxels falling within the language parcels in the left vs. right hemisphere (LH vs. RH), using the following formula: (LH-RH)/(LH+RH); e.g., Jouravlev et al. 2020; individuals with values of 0.25 or greater were considered to have a left-lateralized language system). For the participant with right-lateralized language activations (with a lateralization value at or below -0.25), we used right-hemisphere language regions for the analyses (see SI-3 for analyses where the LH language regions were used for this participant and when this participant is excluded; the critical results were not affected). Participants in Experiments 1 and 2 were native English speakers; participants in Experiment 4 were native Mandarin speakers and proficient speakers of English (none had any knowledge of Russian, which was used in the unfamiliar foreign-language condition in Experiment 4). Detailed information on the participants’ music background was, unfortunately, not collected, except for ensuring that the participants were not professional musicians. All participants gave informed written consent in accordance with the requirements of MIT’s Committee on the Use of Humans as Experimental Subjects (COUHES).
Experiment 3 (behavioral)
Individuals with aphasia
Three participants with severe and chronic aphasia were recruited to the study (SA, PR, and PP). All participants gave informed consent in accordance with the requirements of UCL’s Institutional Review Board. Background information on each participant is presented in Table 1. Anatomical scans are shown in Figure 2A and extensive perisylvian damage in the left hemisphere, encompassing areas where language activity is observed in neurotypical individuals, is illustrated in Figure 2B.
Control participants
We used Amazon.com’s Mechanical Turk platform to recruit normative samples for the music tasks and a subset of the language tasks that are most critical to linguistic syntactic comprehension. Ample evidence now shows that online experiments yield data that closely mirror the data patterns in experiments conducted in a lab setting (e.g., Crump et al. 2013). Data from participants with IP addresses in the US who self-reported being native English speakers were included in the analyses. 50 participants performed the critical music task, and the Scale task from the MBEA (Peretz et al. 2003), as detailed below. Data from participants who responded incorrectly to the catch trial in the MBEA Scale task (n=5) were excluded from the analyses, for a final sample of 45 control participants for the music tasks. A separate sample of 50 participants performed the Comprehension of spoken reversible sentences task. Data from one participant who completed fewer than 75% of the questions and another participant who did not report being a native English speaker were excluded for a final sample of 48 control participants. Finally, a third sample of 50 participants performed the Spoken grammaticality judgment task. Data from one participant who did not report being a native English speaker were excluded for a final sample of 49 control participants.
Design, materials, and procedure
Experiments 1, 2, and 4 (fMRI)
Each participant completed a language localizer task (Fedorenko et al. 2010) and one or more of the critical music perception experiments, along with one or more tasks for unrelated studies. The scanning sessions lasted approximately two hours.
Language localizer
This task is described in detail in Fedorenko et al. (2010) and subsequent studies from the Fedorenko lab (e.g., Fedorenko et al. 2011; Blank et al. 2014; Blank et al. 2016; Pritchett et al. 2018; Paunov et al. 2019; Fedorenko et al. 2020; Shain, Blank et al. 2020, among others) and is available for download from https://evlab.mit.edu/funcloc/). Briefly, participants read sentences and lists of unconnected, pronounceable nonwords in a blocked design. Stimuli were presented one word/nonword at a time at the rate of 450ms per word/nonword. Participants read the materials passively and performed a simple button-press task at the end of each trial (included in order to help participants remain alert). Each participant completed two ∼6-minute runs. This localizer task has been extensively validated and shown to be robust to changes in the materials, modality of presentation (visual vs. auditory), and task (e.g., Fedorenko et al. 2010; Fedorenko 2014; Scott et al. 2017; Diachek, Blank, Siegelman et al. 2020; Malik-Moraleda, Ayyash et al. 2022; Lipkin et al. 2022; see the results of Experiments 1 and 4 for additional replications of modality robustness). Further, a network that corresponds closely to the localizer contrast (sentences > nonwords) emerges robustly from whole-brain task-free data—voxel fluctuations during rest (e.g., Braga et al. 2020; see Braga 2021 for a general discussion of how well-validated localizers tend to show tight correspondence with intrinsic networks recovered in a data-driven way). The fact that different regions of the language network show strong correlations in their activity during naturalistic cognition (see also Blank et al. 2014; Paunov et al. 2019; Malik-Moraleda, Ayyash et al. 2022) provides support for the idea that this network constitutes a ‘natural kind’ in the brain (a subset of the brain that is strongly internally integrated and robustly dissociable from the rest of the brain) and thus a meaningful unit of analysis. However, we also examine individual regions of this network, to paint a more complete picture, given that many past claims about language-music overlap have concerned the inferior frontal component of the language network.
Experiment 1
Participants passively listened to diverse stimuli across 18 conditions in a long-event-related design. The materials for this and all other experiments are available at OSF: https://osf.io/68y7c/. All stimuli were 9 s in length. The conditions were selected to probe responses to music, to examine sensitivity to structure scrambling in music, to compare responses to songs vs. spoken lyrics, and to compare responses to music stimuli vs. other auditory stimuli.
The four non-vocal music conditions (all Western tonal music) included orchestral music, single-instrument music, synthetic drum music, and synthetic melodies (see SI-5 for a summary of the acoustic properties of these and other conditions, as quantified with the MIR toolbox; Lartillot and Toiviainen 2007; Lartillot and Grandjean 2019). The orchestral music condition consisted of 12 stimuli (SI-Table 4a) selected from classical orchestras or jazz bands. The single-instrument music condition consisted of 12 stimuli (SI-Table 4b) that were played on one of the following instruments: cello (n=1), flute (n=1), guitar (n=4), piano (n=4), sax (n=1), or violin (n=1). The synthetic drum music condition consisted of 12 stimuli synthesized using percussion patches from MIDI files taken from freely available online collections. The stimuli were synthesized using the MIDI toolbox for MATLAB (writemidi). The synthetic melodies condition consisted of 12 stimuli transcribed from folk tunes obtained from freely available online collections. Each melody was defined by a sequence of notes with corresponding pitches and durations. Each note was composed of harmonics 1 through 10 of the fundamental presented in equal amplitude, with no gap in-between notes. Phase discontinuities between notes were avoided by ensuring that the starting phase of the next note was equal to the ending phase of the previous note.
The synthetic drum music and the synthetic melodies conditions had scrambled counterparts to probe sensitivity to music structure. This intact > scrambled contrast has been used in some past studies of structure processing in music (e.g., Levitin and Menon 2003) and is parallel to the sentences > word-list contrast in language, which has been often used to probe sensitivity for combinatorial processing (e.g., Fedorenko et al. 2010). The scrambled drum music condition was created by jittering the inter-note-interval (INI). The amount of jitter was sampled from a uniform distribution (from -0.5 to 0.5 beats). The scrambled INIs were truncated to be no smaller than 5% of the distribution of INIs from the intact drum track. The total distribution of INIs was then scaled up or down to ensure that the total duration remained unchanged. The scrambled melodies condition was created by scrambling both pitch and rhythm information. Pitch information was scrambled by randomly re-ordering the sequence of pitches and then adding jitter to disrupt the key. The amount of jitter for each note was sampled from a uniform distribution centered on the note’s pitch after shuffling (from -3 to +3 semitones). The duration of each note was also jittered (from -0.2 to 0.2 beats). To ensure the total duration was unaffected by jitter, N/2 positive jitter values were sampled, where N is the number of notes, and then a negative jitter was added with the same magnitude for each of the positive samples, such that the sum of all jitters equaled 0. To ensure the duration of each note remained positive, the smallest jitters were added to the notes with the smallest durations. Specifically, the note durations and sampled jitters were sorted by their magnitude, summed, and then the jittered durations were randomly re-ordered.
To allow for a direct comparison between music and linguistic conditions within the same experiment, we included auditory sentences and auditory nonword sequences. The sentence condition consisted of 24 lab-constructed stimuli (half recorded by a male, and half by a female). Each stimulus consisted of a short story (each three sentences long) describing common, everyday events. Any given participant heard 12 of the stimuli (6 male, 6 female). The nonword sequence condition consisted of 12 stimuli (recorded by a male).
We also included two other linguistic conditions: songs and spoken lyrics. These conditions were included to test whether the addition of a melodic contour to speech (in songs) would increase the responses of the language regions. Such a pattern might be expected of a brain region that responds to both linguistic content and music structure. The songs and the lyrics conditions each consisted of 24 stimuli. We selected songs with a tune that was easy to sing without accompaniment. These materials were recorded by four male singers: each recorded between 2 and 11 song-lyrics pairs. The singers were actively performing musicians (e.g., in a cappella groups) but were not professionals. Any given participant heard either the song or the lyrics version of an item for 12 stimuli in each condition.
Finally, to assess the specificity of potential responses to music, we included three non-music conditions: animal sounds and two kinds of environmental sounds (pitched and unpitched), which all share some low-level acoustic properties with music (see SI-5). The animal sounds condition and the environmental sounds conditions each consisted of 12 stimuli taken from in-lab collections. If individual recordings were shorter than 9 s, then several recordings of the same type of sound were concatenated together (100 ms gap in between). We included the pitch manipulation in order to test for general responsiveness to pitch—a key component of music—in the language regions.
(The remaining five conditions were not directly relevant to the current study or redundant with other conditions for our research questions and therefore not included in the analyses. These included three distorted speech conditions—lowpass-filtered speech, speech with a flattened pitch contour, and lowpass-filtered speech with a flattened pitch contour—and two additional low-level controls for the synthetic melody stimuli. The speech conditions were included to probe sensitivity to linguistic prosody for an unrelated study. The additional synthetic music control conditions were included to allow for a more rigorous interpretation of the intact > scrambled synthetic melodies effect had we observed such an effect. For completeness, on the OSF page, https://osf.io/68y7c/, we provide a data table that includes responses to these five conditions.)
For each participant, stimuli were randomly divided into six sets (corresponding to runs) with each set containing two stimuli from each condition. The order of the conditions for each run was selected from four predefined palindromic orders, which were constructed so that conditions targeting similar mental processes (e.g., orchestral music and single-instrument music) were separated by other conditions (e.g., speech or animal sounds). Each run contained three 10 s fixation periods: at the beginning, in the middle, and at the end. Otherwise, the stimuli were separated by 3 s fixation periods, for a total run duration of 456 s (7 min 36 s). All but two of the 18 participants completed all six runs (and thus got a total of 12 experimental events per condition); the remaining two completed four runs (and thus got 8 events per condition).
Because, as noted above, we have previously established that the language localizer is robust to presentation modality, we used the visual localizer to define the language regions. However, in SI-2 we show that the critical results are similar when auditory contrasts (sentences > nonwords in Experiment 1, or Mandarin sentences > foreign in Experiment 4) are instead used to define the language regions.
Experiment 2
Participants listened to well-formed melodies (adapted and expanded from Fedorenko et al. 2009) and melodies with a structural violation in a long-event-related design and judged the well-formedness of the melodies. As discussed in the Introduction, this type of manipulation is commonly used to probe sensitivity to music structure, including in studies examining language-music overlap (e.g., Patel et al. 1998; Koelsch et al. 2000, 2002; Maess et al. 2001; Tillmann et al. 2003; Fedorenko et al. 2009; Slevc et al. 2009; Kunert et al. 2015; Musso et al. 2015). The melodies were between 11 and 14 notes. The well-formed condition consisted of 90 melodies, which were tonal and ended in a tonic note with an authentic cadence in the implied harmony. All melodies were isochronous, consisting of quarter notes except for the final half note. The first five notes established a strong sense of key. Each melody was then altered to create a version with a “sour” note: the pitch of one note (from among the last four notes in a melody) was altered up or down by one or two semitones, so as to result in a non-diatonic note while keeping the melodic contour (the up-down pattern) the same. The structural position of the note that underwent this change varied among the tonic, the fifth, and the major third. The full set of 180 melodies was distributed across two lists following a Latin Square design. Any given participant heard stimuli from one list.
For each participant, stimuli were randomly divided into two sets (corresponding to runs) with each set containing 45 melodies (22 or 23 per condition). The order of the conditions, and the distribution of inter-trial fixation periods, was determined by the optseq2 algorithm (Dale et al. 1999). The order was selected from among four predefined orders, with no more than four trials of the same condition in a row. In each trial, participants were presented with a melody for three seconds followed by a question, presented visually on the screen, about the well-formedness of the melody (“Is the melody well-formed?”). To respond, participants had to press one of two buttons on a button box within two seconds. When participants answered, the question was replaced by a blank screen for the remainder of the two-second window; if no response was made within the two-second window, the experiment advanced to the next trial. Responses received within one second after the end of the previous trial were still recorded to account for the possible slow responses. The screen was blank during the presentation of the melodies. Each run contained 151 s of fixation interleaved among the trials, for a total run duration of 376 s (6 min 16 s). Fourteen of the 20 participants completed both runs, four participants completed one run, and the two remaining participants completed two runs but we only included their first run because, due to experimenter error, the second run came from a different experimental list and thus included some of the melodies from the first run in the other condition (the data pattern was qualitatively and quantitatively the same if both runs were included for these participants). Finally, due to a script error, participants only heard the first 12 notes of each melody during the three seconds of stimulus presentation. Therefore, we only analyzed the 80 of the 90 pairs (160 of the 180 total melodies) where the contrastive note appeared within the first 12 notes.
Experiment 4
Participants passively listened to single-instrument music, environmental sounds, sentences in their native language (Mandarin), and sentences in an unfamiliar foreign language (Russian) in a blocked design. All stimuli were 5-5.95s in length. The conditions were selected to probe responses to music, and to compare responses to music stimuli vs. other auditory stimuli. The critical music condition consisted of 60 stimuli selected from classical pieces by J.S. Bach played on cello, flute, or violin (n=15 each) and jazz music played on saxophone (n=15). The environmental sounds condition consisted of 60 stimuli selected from in-lab collections and included both pitched and unpitched stimuli. The foreign language condition consisted of 60 stimuli selected from Russian audiobooks (short stories by Paustovsky and “Fathers and Sons” by Turgenev). The foreign language condition was included because creating a ‘nonwords’ condition (the baseline condition we typically use for defining the language regions; Fedorenko et al. 2010) is challenging in Mandarin given that most words are monosyllabic, thus most syllables carry some meaning. As a result, sequences of syllables are more akin to lists of words. Therefore, we included the unfamiliar foreign language condition, which also works well as a baseline for language processing (Malik-Moraleda, Ayyash et al. 2022). The Mandarin sentence condition consisted of 120 lab-constructed sentences, each recorded by a male and a female native speaker. (The experiment also included five conditions that were not relevant to the current study and therefore not included in the analyses. These included three conditions probing responses to the participants’ second language (English) and two control conditions for Mandarin sentences. For completeness, on the OSF page, https://osf.io/68y7c/, we provide a data table that includes responses to these five conditions.)
Stimuli were grouped into blocks with each block consisting of three stimuli and lasting 18s (stimuli were padded with silence to make each trial exactly six seconds long). For each participant, blocks were divided into 10 sets (corresponding to runs), with each set containing two blocks from each condition. The order of the conditions for each run was selected from eight predefined palindromic orders. Each run contained three 14 s fixation periods: at the beginning, in the middle, and at the end, for a total run duration of 366 s (6 min 6 s). Five participants completed eight of the 10 runs (and thus got 16 blocks per condition; the remaining thirteen completed six runs (and thus got 12 blocks per condition). (We had created enough materials for 10 runs, but based on observing robust effects for several key contrasts in the first few participants who completed six to eight runs, we administered 6-8 runs to the remaining participants.)
Because we have previously found that an English localizer works well in native speakers of diverse languages, including Mandarin, as long as they are proficient in English (Malik-Moraleda, Ayyash et al. 2022), we used the same localizer in Experiment 4 as the one used in Experiments 1 and 2, for consistency. However, in SI-2 (SI-Figure 2c, SI-Table 2c) we show that the critical results are similar when the Mandarin sentences > foreign contrast is instead used to define the language regions.
Experiment 3 (behavioral)
Language assessments. Participants with aphasia were assessed for the integrity of lexical processing using word-to-picture matching tasks in both spoken and written modalities (ADA Spoken and Written Word-Picture Matching; Franklin et al. 1992). Productive vocabulary was assessed through picture naming. In the spoken modality, the Boston Naming Test was employed (Kaplan et al. 2001), and in writing, the PALPA Written Picture Naming subtest (Kay et al. 1992). Sentence processing was evaluated in both spoken and written modalities through comprehension (sentence-to-picture matching) of reversible sentences in active and passive voice. In a reversible sentence, the heads of both noun phrases are plausible agents, and therefore, word order, function words, and functional morphology are the only cues to who is doing what to whom. Participants also completed spoken and written grammaticality judgment tasks, where they made a yes/no decision as to the grammaticality of a word string. The task employed a subset of sentences from Linebarger et al. (1983).
All three participants exhibited severe language impairments that disrupted both comprehension and production (Table 2). For lexical-semantic tasks, all three participants displayed residual comprehension ability for high imageability/picturable vocabulary, although more difficulty was evident on the synonym matching test, which included abstract words. They were all severely anomic in speech and writing. Sentence production was severely impaired with output limited to single words, social speech (expressions, like “How are you?”), and other formulaic expressions (e.g., “and so forth”). Critically, all three performed at or close to chance level on spoken and written comprehension of reversible sentences and grammaticality judgments; each patient’s scores were lower than all of the healthy controls (Table 2 and Figure 2C).
Critical music task
Participants judged the well-formedness of the melodies from Experiment 2. Judgments were intended to reflect the detection of the key violation in the sour versions of the melodies. The full set of 180 melodies was distributed across two lists following a Latin Square design. All participants heard all 180 melodies. The control participants heard the melodies from one list, followed by the melodies from the other list, with the order of lists counter-balanced across participants. For the participants with aphasia, each list was further divided in half, and each participant was tested across four sessions, with 45 melodies per session, to minimize fatigue.
Montreal Battery for the Evaluation of Amusia
To obtain another measure of music competence/sensitivity to music structure, we administered the Montreal Battery for the Evaluation of Amusia (MBEA) (Peretz et al. 2003). The battery consists of six tasks that assess musical processing components described by Peretz and Coltheart (2003): three target melodic processing, two target rhythmic processing, and one assesses memory for melodies. Each task consists of 30 experimental trials (and uses the same set of 30 base melodies) and is preceded by practice examples. Some of the tasks additionally include a catch trial, as described below. For the purposes of the current investigation, the critical task is the “Scale” task. Participants are presented with pairs of melodies that they have to judge as identical or not. On half of the trials, one of the melodies is altered by modifying the pitch of one of the tones to be out of scale. Like our critical music task, this task aims to test participants’ ability to represent and use tonal structure in Western music, except that instead of making judgments on each individual melody, participants compare two melodies on each trial. This task thus serves as a conceptual replication (Schmidt 2009). One trial contains stimuli designed to be easy, intended as a catch trial to ensure that participants are paying attention. In this trial, the comparison melody has all its pitches set at random. This trial is excluded when computing the scores.
Control participants performed just the Scale task. Participants with aphasia performed all six tasks, distributed across three testing sessions to minimize fatigue.
fMRI data acquisition, preprocessing, and first-level modeling (for Experiments 1, 2, and 4)
Data acquisition
Whole-brain structural and functional data were collected on a whole-body 3 Tesla Siemens Trio scanner with a 32-channel head coil at the Athinoula A. Martinos Imaging Center at the McGovern Institute for Brain Research at MIT. T1-weighted structural images were collected in 176 axial slices with 1 mm isotropic voxels (repetition time (TR) = 2,530 ms; echo time (TE) = 3.48 ms). Functional, blood oxygenation level-dependent (BOLD) data were acquired using an EPI sequence with a 90° flip angle and using GRAPPA with an acceleration factor of 2; the following parameters were used: thirty-one 4.4 mm thick near-axial slices acquired in an interleaved order (with 10% distance factor), with an in-plane resolution of 2.1 mm × 2.1 mm, FoV in the phase encoding (A >> P) direction 200 mm and matrix size 96 × 96 voxels, TR = 2000 ms and TE = 30 ms. The first 10 s of each run were excluded to allow for steady state magnetization (see OSF https://osf.io/68y7c/ for the pdf of the scanning protocols). (Note that we opted to use a regular, continuous, scanning sequence in spite of investigating responses to auditory conditions. However, effects of scanner noise are unlikely to be detrimental given that all the stimuli are clearly perceptible, as also confirmed by examining responses in the auditory areas.)
Preprocessing
fMRI data were analyzed using SPM12 (release 7487), CONN EvLab module (release 19b), and other custom MATLAB scripts. Each participant’s functional and structural data were converted from DICOM to NIFTI format. All functional scans were coregistered and resampled using B-spline interpolation to the first scan of the first session (Friston et al. 1995). Potential outlier scans were identified from the resulting subject-motion estimates as well as from BOLD signal indicators using default thresholds in CONN preprocessing pipeline (5 standard deviations above the mean in global BOLD signal change, or framewise displacement values above 0.9 mm; Nieto-Castañón 2020). Functional and structural data were independently normalized into a common space (the Montreal Neurological Institute [MNI] template; IXI549Space) using SPM12 unified segmentation and normalization procedure (Ashburner and Friston 2005) with a reference functional image computed as the mean functional data after realignment across all timepoints omitting outlier scans. The output data were resampled to a common bounding box between MNI-space coordinates (-90, -126, -72) and (90, 90, 108), using 2mm isotropic voxels and 4th order spline interpolation for the functional data, and 1mm isotropic voxels and trilinear interpolation for the structural data. Last, the functional data were smoothed spatially using spatial convolution with a 4 mm FWHM Gaussian kernel.
First-level modeling
For both the language localizer task and the critical experiments, effects were estimated using a General Linear Model (GLM) in which each experimental condition was modeled with a boxcar function convolved with the canonical hemodynamic response function (HRF) (fixation was modeled implicitly, such that all timepoints that did not correspond to one of the conditions were assumed to correspond to a fixation period). Temporal autocorrelations in the BOLD signal timeseries were accounted for by a combination of high-pass filtering with a 128 seconds cutoff, and whitening using an AR(0.2) model (first-order autoregressive model linearized around the coefficient a=0.2) to approximate the observed covariance of the functional data in the context of Restricted Maximum Likelihood estimation (ReML). In addition to experimental condition effects, the GLM design included first-order temporal derivatives for each condition (included to model variability in the HRF delays), as well as nuisance regressors to control for the effect of slow linear drifts, subject-motion parameters, and potential outlier scans on the BOLD signal.
Definition of the language functional regions of interest (fROIs) (for Experiments 1, 2, and 4)
For each critical experiment, we defined a set of language functional regions of interest (fROIs) using group-constrained, subject-specific localization (Fedorenko et al. 2010). In particular, each individual map for the sentences > nonwords contrast from the language localizer was intersected with a set of five binary masks. These masks (Figure 3; available at OSF: https://osf.io/68y7c/) were derived from a probabilistic activation overlap map for the same contrast in a large independent set of participants (n=220) using watershed parcellation, as described in Fedorenko et al. (2010) for a smaller set of participants. These masks covered the fronto-temporal language network in the left hemisphere. Within each mask, a participant-specific language fROI was defined as the top 10% of voxels with the highest t-values for the localizer contrast.
Validation of the language fROIs
To ensure that the language fROIs behave as expected (i.e., show a reliably greater response to the sentences condition compared to the nonwords condition), we used an across-runs cross-validation procedure (e.g., Nieto-Castañón and Fedorenko 2012). In this analysis, the first run of the localizer was used to define the fROIs, and the second run to estimate the responses (in percent BOLD signal change, PSC) to the localizer conditions, ensuring independence (e.g., Kriegeskorte et al. 2009); then the second run was used to define the fROIs, and the first run to estimate the responses; finally, the extracted magnitudes were averaged across the two runs to derive a single response magnitude for each of the localizer conditions. Statistical analyses were performed on these extracted PSC values. Consistent with much previous work (e.g., Fedorenko et al. 2010; Mahowald and Fedorenko 2016; Diachek, Blank, Siegelman et al. 2020), each of the language fROIs showed a robust sentences > nonwords effect (all ps < 0.001).
Statistical analyses for the fMRI experiments
All analyses were performed with linear mixed-effects models using the “lme4” package in R with p-value approximation performed by the “lmerTest” package (Bates et al. 2015; Kuznetsova et al. 2017). Effect size (Cohen’s d) was calculated using the method from Westfall et al. (2014) and Brysbaert and Stevens (2018).
Sanity check analyses and results
To estimate the responses in the language fROIs to the conditions of the critical experiments here and in the critical analyses, the data from all the runs of the language localizer were used to define the fROIs, and the responses to each condition were then estimated in these regions. Statistical analyses were then performed on these extracted PSC values. (For Experiments 1 and 4, we repeated the analyses using alternative language localizer contrasts to define the language fROIs (auditory sentences > nonwords in Experiment 1, and Mandarin sentences > foreign in Experiment 4), which yielded quantitatively and qualitatively similar responses (see SI-2).)
We conducted two sets of sanity check analyses. First, to ensure that auditory conditions that contain meaningful linguistic content elicit strong responses in the language regions relative to perceptually similar conditions with no discernible linguistic content, we compared the auditory sentences condition with the auditory nonwords condition (Experiment 1) or with the foreign language condition (Experiment 4). Indeed, as expected, the auditory sentence condition elicited a stronger response than the auditory nonwords condition (Experiment 1) or the foreign language condition (Experiment 4). These effects were robust at the network level (ps < 0.001; SI-Table 1a). Further, the sentences > nonwords effect was significant in all but one language fROI in Experiment 1, and the sentences > foreign effect was significant in all language fROIs in Experiment 4 (ps < 0.05; SI-Table 1a).
And second, to ensure that the music conditions elicit strong responses in auditory cortex, we extracted the responses from a bilateral anatomically defined auditory cortical region (area Te1.2 from the Morosan et al. 2001 cytoarchitectonic probabilistic atlas) to the six critical music conditions: orchestral music, single instrument music, synthetic drum music, and synthetic melodies in Experiment 1; well-formed melodies in Experiment 2; and the music condition in Experiment 4. Statistical analyses, comparing each condition to the fixation baseline, were performed on these extracted PSC values. As expected, all music conditions elicited strong responses in a primary auditory area bilaterally (all ps ≅ 0.001; SI-Table 1b; SI-Figure 1).
Critical analyses
To characterize the responses in the language network to music perception, we asked three questions. First, we asked whether music conditions elicit strong responses in the language regions. Second, we investigated whether the language network is sensitive to structure in music, as would be evidenced by stronger responses to intact than scrambled music, and stronger responses to melodies with structural violations compared to the no-violation control condition. And third, we asked whether music conditions elicit strong responses in the language regions of individuals with high sensitivity to linguistic pitch—native speakers of a tonal language (Mandarin).
For each contrast (the contrasts relevant to the three research questions are detailed below), we used two types of linear mixed-effect regression models:
the language network model, which examined the language network as a whole; and
the individual language fROI models, which examined each language fROI separately.
As alluded to in the Introduction, treating the language network as an integrated system is reasonable given that the regions of this network a) show similar functional profiles, both with respect to selectivity for language over non-linguistic processes (e.g., Fedorenko et al. 2011; Pritchett et al. 2018; Jouravlev et al. 2019; Ivanova et al. 2020, 2021) and with respect to their role in lexico-semantic and syntactic processing (e.g., Fedorenko et al. 2012b; Blank et al. 2016; Fedorenko et al. 2020); and b) exhibit strong inter-region correlations in both their activity during naturalistic cognition paradigms (e.g., Blank et al. 2014; Braga et al. 2020; Paunov et al. 2019; Malik-Moraleda, Ayyash et al. 2022) and key functional markers, like the strength or extent of activation in response to language stimuli (e.g., Mahowald and Fedorenko 2016; Mineroff, Blank et al. 2018). However, to allow for the possibility that language regions differ in their response to music and to examine the region on which most claims about language-music overlap have focused (the region that falls within ‘Broca’s area’), we supplement the network-wise analyses with the analyses of the five language fROIs separately.
For each network-wise analysis, we fit a linear mixed-effect regression model predicting the level of BOLD response in the language fROIs in the contrasted conditions. The model included a fixed effect for condition (the relevant contrasts are detailed below for each analysis) and random intercepts for fROIs and participants. Here and elsewhere, the p-value was estimated by applying the Satterthwaite’s method-of-moment approximation to obtain the degrees of freedom (Giesbrecht and Burns 1985; Fai and Cornelius 1996; as described in Kuznetsova et al. 2017). For the comparison against the fixation baseline, the random intercept for participants was removed because it is no longer applicable. For each fROI-wise analysis, we fit a linear mixed-effect regression model predicting the level of BOLD response in each of the five language fROIs in the contrasted conditions. The model included a fixed effect for condition and random intercepts for participants. For each analysis, the results were FDR-corrected for the five fROIs. For the comparison against the fixation baseline, the random intercept for participants was removed because it is no longer applicable.
Results
Does music elicit a response in the language network?
As discussed in the Introduction, a brain region that supports (some aspect of) music processing, including structure processing, should show a strong response to music stimuli. To test whether language regions respond to music, we used four contrasts using data from Experiments 1 and 2. First, we compared the responses to each of the music conditions (orchestral music, single instrument music, synthetic drum music, and synthetic melodies in Experiment 1; well-formed melodies in Experiment 2) against the fixation baseline—the most liberal baseline. Second, we compared the responses to the music conditions against the response to the nonword strings condition—an unstructured and meaningless linguistic stimulus (in Experiment 1, we used the auditory nonwords condition, and in Experiment 2, we used the visual nonwords condition from the language localizer). Third, in Experiment 1, we additionally compared the responses to the music conditions against the response to non-linguistic, non-music stimuli (animal and environmental sounds). A brain region that supports music processing should elicit a strong positive response relative to the fixation baseline and the nonwords condition (our baseline for the language regions); further, if the response is selective, it should be stronger than the response elicited by non-music auditory stimuli. Finally, in Experiment 1, we also directly compared the responses to songs vs. lyrics. A brain region that responds to music should respond more strongly to songs given that they contain a melodic contour in addition to the linguistic content.
None of the music conditions elicited a strong response in the language network (Figure 3; Table 3). The responses to music (i) fell at or below the fixation baseline (except for the well-formed melodies condition in Experiment 2, which elicited a small positive response in some regions), (ii) were lower than the response elicited by auditory nonwords (except for the LMFG language fROI, where the responses to music and nonwords were similarly low), and (iii) did not significantly differ from the responses elicited by non-linguistic, non-music conditions. Finally, the response to songs, which contain both linguistic content and a melodic contour, was not significantly higher than the response elicited by the linguistic content alone (lyrics); in fact, at the network level, the response to songs was reliably lower than to lyrics.
Is the language network sensitive to structure in music?
Experiments 1 and 2 (fMRI)
Because most prior claims about the overlap between language and music concern the processing of structure—given the parallels that can be drawn between the syntactic structure of language and the tonal and rhythmic structure in music (e.g., Lerdahl and Jackendoff 1977, 1983; cf. Jackendoff 2009)—we used three contrasts to test whether language regions are sensitive to music structure. First and second, in Experiment 1, we compared the responses to synthetic melodies vs. their scrambled counterparts, and to synthetic drum music vs. the scrambled drum music condition. The former targets both tonal and rhythmic structure, and the latter selectively targets rhythmic structure. The reason to examine rhythmic structure is that some patient studies have argued that pitch contour processing relies on the right hemisphere, and rhythm processing draws on the left hemisphere (e.g., Zatorre 1984; Peretz 1990; Alcock et al. 2000; cf. Boebinger 2021 for fMRI evidence of bilateral responses in high-level auditory areas to both tonal and rhythmic structure processing and for lack of spatial segregation between the two), so although most prior work examining the language-music relationship has focused on tonal structure, rhythmic structure may a priori be more likely to overlap with linguistic syntactic structure given their alleged co-lateralization based on the patient literature. And third, in Experiment 2, we compared the responses to well-formed melodies vs. melodies with a sour note. A brain region that responds to structure in music should respond more strongly to intact than scrambled music (similar to how language regions respond more strongly to sentences than lists of words; e.g., Fedorenko et al. 2010; Diachek, Blank, Siegelman et al. 2020), and also exhibit sensitivity to structure violations (similar to how language regions respond more strongly to sentences that contain grammatical errors: e.g., Embick et al. 2000; Newman et al. 2001; Kuperberg et al. 2003; Cooke et al. 2006; Friederici et al. 2010; Herrmann et al. 2012; Fedorenko et al. 2020). Note that given the lack of a strong and consistent response to music in the language regions (Figure 3 and Table 3), the answer to this narrower question is somewhat of a foregone conclusion: even if one or more of the language regions showed a reliable effect in these music-structure-probing contrasts, such effects would be difficult to interpret as reflecting music structure processing given that structured music stimuli elicit a response approximately at the level of the fixation baseline in the language areas. Nevertheless, we report the results for these three contrasts for completeness, and because most prior studies have focused on such contrasts.
The language regions did not show consistent sensitivity to structural manipulations in music (Figure 4; Table 4). In Experiment 1, the responses to synthetic melodies did not significantly differ from (or were weaker than) the responses to the scrambled counterparts, and the responses to synthetic drum music did not significantly differ from the responses to scrambled drum music. In Experiment 2, at the network level, we observed a small and weakly significant (p<0.05) effect of sour-note > well-formed melodies. This effect was not significant in any of the five individual fROIs (even prior to the FDR correction).
Experiment 3 (behavioral)
In Experiment 3, we further asked whether individuals with severe deficits in processing linguistic syntax also exhibit difficulties in processing music structure. To do so, we assessed participants’ ability to discriminate well-formed (“good”) melodies from melodies with a sour note (“bad”), while controlling for their response bias (how likely they are overall to say that something is well-formed) by computing d’ for each participant (Green and Swets 1966), in addition to proportion correct. We then compared the d’ values of each individual with aphasia to the distribution of d’ values of healthy control participants using a Bayesian test for single case assessment (Crawford and Garthwaite 2007) as implemented in the psycho package in R (Makowski 2018). (Note that for the linguistic syntax tasks, it was not necessary to conduct statistical tests comparing the performance of each individual with aphasia to the control distribution because the performance of each individual with aphasia was lower than 100% of the control participants’ performances.) We similarly compared the proportion correct on the MBEA scale task of each individual with aphasia to the distribution of accuracies of healthy controls. If linguistic and music syntax draw on the same resources, then individuals with linguistic syntactic impairments should also exhibit deficits on tasks requiring the processing of music syntax.
In the critical music task, where participants were asked to judge the well-formedness of musical structure, neurotypical control participants responded correctly, on average, on 87.1% of trials, suggesting that the task was sufficiently difficult to preclude ceiling effects. Patients with severe aphasia showed intact sensitivity to music structure. The three patients had accuracies of 89.4% (PR), 94.4% (SA), and 97.8% (PP), falling on the higher end of the controls’ performance range (Figure 5; Table 5). Crucially, none of the three aphasic participants’ d’ scores were lower than the average control participants’ d’ scores (M = 2.75, SD = 0.75). In fact, the patients’ d’ scores were high: SA’s d’ was 3.51, higher than 83.91% (95% Credible Interval (CI) [75.20, 92.03]) of the control population, PR’s d’ was 3.09, higher than 67.26% (95% CI [56.60, 78.03]) of the control population, and PP’s d’ was 3.99, higher than 94.55% (95% CI [89.40, 98.57]) of the control population. None of the three aphasic participants’ bias/criterion c scores (Green and Swets 1966) differed reliably from the control participants’ c scores (M = -0.40, SD = 0.40). SA’s c was -0.53, lower than 62.34% (95% CI [50.40, 71.67]) of the control population, PR’s c was -0.74, lower than 79.48% (95% CI [69.58, 88.44]) of the control population, and PP’s c was-0.29, higher than 60.88% (95% CI [50.08, 70.04]) of the control population. In the Scale task from the Montreal Battery for the Evaluation of Aphasia, the control participants’ performance showed a similar distribution to that reported in Peretz et al. (2003). All participants with aphasia performed within the normal range, with two participants making no errors. PR and PP’s score was higher than 85.24% (95% CI [76.94, 93.06]) of the control population, providing a conceptual replication of the results from the well-formed/sour-note melody discrimination task. SA’s score was higher than 30.57% (95% CI [20.00, 41.50]) of the control population.
Does music elicit a response in the language network of native speakers of a tonal language?
The above analyses focus on the language network’s responses to music stimuli and its sensitivity to music structure in English native speakers. However, some have argued that responses to music may differ in speakers of languages that use pitch to make lexical or grammatical distinctions (e.g., Deutsch et al. 2006, 2009; Bidelman et al. 2011; Creel et al. 2018; Ngo et al. 2016, Liu et al. 2021). In Experiment 4, we therefore tested whether language regions of Mandarin native speakers respond to music. Similar to Experiment 1, we compared the response to the music condition against a) the fixation baseline, b) the foreign language condition, and c) a non-linguistic, non-music condition (environmental sounds). A brain region that supports music processing should respond more strongly to music than the fixation baseline and the foreign condition; if the response is further selective, it should be stronger than the response elicited by environmental sounds.
Results from Mandarin native speakers replicated the results from Experiment 1: the music condition did not elicit a strong response in the language network (Figure 6; Table 6). Although the response to music was above the fixation baseline at the network level and in some fROIs, the response did not differ from (or was lower than) the responses elicited by an unfamiliar foreign language (Russian) and environmental sounds.
Discussion
We here tackled a much investigated but still debated question: do the brain regions of the language network support the processing of music, especially music structure? Across three fMRI experiments, we obtained a clear answer: the brain regions of the language network, which support the processing of linguistic syntax (e.g., Fedorenko et al. 2010, 2020; Pallier et al. 2011; Bautista and Wilson 2016; Blank et al. 2016), do not support music processing (see Table 7 for a summary of the results). We found overall low responses to music (including orchestral pieces, solo pieces played on different instruments, synthetic music, and vocal music) in the language brain regions (Figure 3; see Sueoka et al. 2022, for complementary evidence from the inter-subject correlation approach applied to a rich naturalistic music stimulus), including in speakers of a tonal language (Figure 6), and no consistent sensitivity to manipulations of music structure (Figure 4). We further found that the ability to make well-formedness judgments about the tonal structure of music was preserved in patients with severe aphasia who cannot make grammaticality judgments for sentences (Figure 5), although we acknowledge the possibility that general ability to detect unexpected events may have contributed to performance on the critical music-structure tasks (e.g., Bigand et al. 2014; Collins et al. 2014) and that additional controls would be needed to conclusively determine whether these patients have preserved music-structure processing abilities. Nevertheless, given the brain imaging results (summarized in Table 7), a critical role of the language system in music structure processing is unlikely.
Our findings align with a) prior neuropsychological patient evidence of language/music dissociations (e.g., Luria et al. 1965; Brust 1980; Marin 1982; Basso and Capitani 1985; Polk and Kertesz 1993; Peretz et al. 1994, 1997; Piccirilli et al. 2000; Peretz and Coltheart 2003; Slevc et al. 2016; Faroqi-Shah et al. 2020; Chiapetta et al. 2022) and with b) prior evidence that music is processed by music-selective areas in the auditory cortex (Norman-Haignere et al. (2015; see also Boebinger et al. 2021; see Peretz et al. 2015, for review and discussion). The latter, music-selective areas are strongly sensitive to the scrambling of music structure in stimuli like those used here in Experiment 1 (see also Fedorenko et al. 2012c, Boebinger 2021; see Mehr et al. 2019 for a priori reasons to expect the effects of tonal structure manipulations in music-selective brain regions). (We provide the responses of music-responsive areas to the conditions of Experiments 1 and 2 at: https://osf.io/68y7c/).) In contrast, our findings stand in sharp contrast to numerous reports arguing for shared structure processing mechanisms in the two domains, including specifically in the inferior frontal cortex, within ‘Broca’s area’ (e.g., Patel et al. 1998; Koelsch et al. 2000; Maess et al. 2001; Koelsch et al. 2002; Levitin and Menon 2003; see Kunert and Slevc 2015; LaCroix et al. 2016; Vuust et al. 2022 for reviews).
Below, we discuss several issues that are relevant for interpreting the current results and/or that these results inform, and outline some limitations of scope of our study.
1. Theoretical considerations about the language-music relationship
Why might we a priori think that the language network, or some of its components, may be important for processing music in general, or for processing music structure specifically? Similarities between language and music have long been noted and discussed. For example, as summarized in Jackendoff (2009; see also Patel 2008), both capacities are human-specific, involve the production of sound (though this is not always the case for language: cf. sign languages, or written language in literate societies), and have multiple culture-specific variants. Furthermore, language and music are intertwined in songs, which appear to be a cultural universal (e.g., Brown 1991; Nettl 2015; see Mehr et al. 2019 for empirical support; see Norman-Haignere et al. 2021 for evidence of neural selectivity for songs in the auditory cortex). However, Jackendoff (2009) notes that i) most cognitive capacities / mechanisms that have been argued to be common to language and music are not uniquely shared by language and music, and ii) language and music differ in several critical ways, and these differences are important to consider alongside potential similarities when theorizing about possible shared representations and computations.
To elaborate on the first point: the cognitive capacity that has perhaps received the most attention in discussions of cognitive and neural mechanisms that may be shared by language and music is the combinatorial capacity of the two domains (e.g., Riemann 1877, as cited in Swain 1995; Lindblom and Sundberg 1969; Fay 1971; Sundberg and Lindblom 1976; Lerdahl and Jackendoff 1977, 1983; Roads 1979; Krumhansl and Keil 1982). In particular, in language, words can be combined into complex hierarchical structures to form novel phrases and sentences, and in music, notes and chords can similarly be combined to form novel melodies. Further, in both domains, the combinatorial process is constrained by a set of conventions. However, this capacity can be observed, in some form, in many other domains, from visual processing, to math, to social cognition, to motor planning, to general reasoning. Similarly, other cognitive capacities that are necessary to process language and music—including a large long-term memory store for previously encountered elements and patterns, a working memory capacity needed to integrate information as it comes in, an ability to form expectations about upcoming elements, and an ability to engage in joint action—are important for information processing in other domains. An observation that some mental capacity is necessary for multiple domains is compatible with at least two architectures: one where the relevant capacity is implemented (perhaps in a similar way) in each relevant set of domain-specific circuits, and another where the relevant capacity is implemented in a centralized mechanism that all domains draw on (e.g., Fedorenko and Shain 2021). Those arguing for overlap between language and music processing advocate a version of the latter. Critically, any shared mechanism that language and music would draw on should also support information processing in other domains that require the relevant computation (see Section 3 below for arguments against this kind of architecture). (A possible exception, according to Jackendoff (2009), may be the fine-scale vocal motor control that is needed for speech and vocal music production (cf. sign language or instrumental music), but not any other behaviors, but this kind of ability is implemented outside of the core high-level language system, in the network of brain areas that support articulation (e.g., Basilakos et al. 2015; Guenter 2016).)
More importantly, aside from the similarities that have been noted between language and music, numerous differences characterize the two domains. Most notable are their different functions. Language enables humans to express propositional meanings, and thus to share thoughts with one another. The function of music has long been debated (e.g., Darwin 1871; Pinker 1994; see e.g., McDermott 2008 and Mehr et al. 2020, for a summary of key ideas), but most proposed functions have to do with emotional or affective processing, often with a social component1 (Jackendoff 2009; Savage et al. 2020). If function drives the organization of the brain (and biological systems more generally; e.g., Rueffler et al. 2012) by imposing particular computational demands on each domain (e.g., Mehr et al. 2020), these fundamentally different functions of language and music provide a theoretical reason to expect cognitive and neural separation between them. Besides, even the components of language and music that appear similar on the surface (e.g., combinatorial processing) differ in deep and important ways (e.g., Patel 2008; Jackendoff 2009; Slevc 2009; Temperley 2022).
2. Functional selectivity of the language network
The current results add to the growing body of evidence that the left-lateralized fronto-temporal brain network that supports language processing is highly selective for linguistic input (e.g., Fedorenko et al. 2011; Monti et al. 2009, 2012; Deen et al. 2015; Pritchett et al. 2018; Jouravlev et al. 2019; Ivanova et al. 2020, 2021; Benn, Ivanova et al. 2021; Liu et al. 2020; Deen and Freiwald 2021; Paunov et al. 2022; Sueoka et al. 2022; see Fedorenko and Blank 2020 for a review) and not critically needed for many forms of complex cognition (e.g., Lecours and Joanette 1980; Varley and Siegal 2000; Varley et al. 2005; Apperly et al. 2006; Woolgar et al. 2018; Ivanova et al. 2021; see Fedorenko and Varley 2016 for a review). Importantly, this selectivity holds across all components of the language network, including the parts that fall within ‘Broca’s area’ in the left inferior frontal gyrus. As discussed in the introduction, many claims about shared structure processing in language and music have focused specifically on Broca’s area (e.g., Patel 2003; Fadiga et al. 2009; Fitch and Martins 2014). The evidence presented here shows that the language-responsive parts of Broca’s area, which are robustly sensitive to linguistic syntactic manipulations (e.g., Just et al. 1996; Stromswold et al. 1996; Ben-Shachar et al. 2003; Caplan et al. 2008; Peelle et al. 2010; Blank et al. 2016; see e.g., Friederici 2011 and Hagoort and Indefrey 2014 for meta-analyses), do not respond when we listen to music and are not sensitive to structure in music. These results rule out the hypothesis that language and music processing rely on the same mechanism housed in Broca’s area.
It is also worth noting that the very premise of the latter hypothesis—of a special relationship between Broca’s area and the processing of linguistic syntax (e.g., Caramazza and Zurif 1976; Friederici 2018)—has been questioned and overturned. First, syntactic processing does not appear to be carried out focally, but is instead distributed across the entire language network, with all of its regions showing sensitivity to syntactic manipulations (e.g., Fedorenko et al. 2010, 2020; Pallier et al. 2011; Blank et al. 2016; Shain, Blank et al. 2020; Shain et al. 2022), and with damage to different components leading to similar syntactic comprehension deficits (e.g., Caplan et al. 1996; Dick et al. 2001; Wilson and Saygin 2004; Mesulam et al. 2014; Mesulam et al. 2015). And second, the language-responsive part of Broca’s area, like other parts of the language network, is sensitive to both syntactic processing and word meanings, and even sub-lexical structure (Fedorenko et al. 2010, 2012b, 2020; Regev et al. 2021; Shain et al. 2021). The lack of segregation between syntactic and lexico-semantic processing is in line with the idea of ‘lexicalized syntax’ where the conventions for how words can combine with one another are highly dependent on the particular lexical items (e.g., Goldberg 2002; Jackendoff 2002, 2007; Sag et al. 2003; Levin and Rappaport-Hovav 2005; Bybee 2010; Jackendoff and Audring 2020), and is contra the idea of combinatorial rules that are blind to the content/meaning of the to-be-combined elements (e.g., Chomsky 1965, 1995; Fodor 1983; Pinker and Prince 1988; Pinker 1991, 1999; Pallier et al. 2011).
3. Overlap in structure processing in language and music outside of the core language network?
We have here focused on the core fronto-temporal language network. Could structure processing in language and music draw on shared resources elsewhere in the brain? The prime candidate is the domain-general executive control, or Multiple Demand (MD), network (e.g., Duncan and Owen 2000; Duncan 2001, 2010; Assem et al. 2020), which supports functions like working memory and inhibitory control. Indeed, according to Patel’s Shared Structural Integration Resource Hypothesis (SSIRH; 2003, 2008, 2012), language and music draw on separate representations, stored in distinct cortical areas, but rely on the same working memory store to integrate incoming elements into evolving structures. Relatedly, Slevc et al. (2013; see Asano et al. 2021 for a related proposal) have argued that another executive resource—inhibitory control—may be required for structure processing in both language and music. Although it is certainly possible that some aspects of linguistic and/or musical processing would require domain-general executive resources, based on the available evidence from the domain of language, we would argue that any such engagement does not reflect the engagement of computations like syntactic structure building. In particular, Blank and Fedorenko (2017) found that activity in the brain regions of the domain-general MD network does not closely ‘track’ linguistic stimuli, as evidenced by low inter-subject correlations during the processing of linguistic input (see Paunov et al. 2021 and Sueoka et al. 2022 for replications). Further, Diachek, Blank, Siegelman et al. (2020) showed in a large-scale fMRI investigation that the MD network is not engaged during language processing in the absence of secondary task demands (cf. the core language network, which is relatively insensitive to task demands and responds robustly even during passive listening/reading). And Shain, Blank et al. (2020; also, Shain et al. 2022) have shown that the language network, but not the MD network, is sensitive to linguistic surprisal and working-memory integration costs (see also Wehbe et al. 2021 for evidence that activity in the language, but not the MD, network reflects general incremental processing difficulty).
In tandem, this evidence argues against the role of executive resources in core linguistic computations like those related to lexical access and combinatorial processing, including syntactic parsing and semantic composition (see also Hasson et al. 2015 and Dasgupta and Gershman 2021 for general arguments against the separation between memory and computation in the brain). Thus, although the contribution of executive resources to music processing deserves further investigation (cf. https://osf.io/68y7c/ for evidence of low responses of the MD network to the music conditions in the current study), any overlap within the executive system between linguistic and music processing cannot reflect core linguistic computations, as those seem to be carried out by the language network (see Fedorenko and Shain 2021, for a review). Functionally identifying the MD network in individual participants (e.g., Fedorenko et al. 2013; Shashidhara et al. 2019) is a powerful way to help interpret the observed effects of music manipulations as reflecting general executive demands (see Saxe et al. 2006, Blank et al. 2017 and Fedorenko 2021, for general discussions of greater interpretability of fMRI results obtained from the functional localization approach). Importantly, given the ubiquitous sensitivity of the MD network to cognitive demands, it is / will be important to rule out task demands, rather than stimulus processing, as the source of overlap between music and language processing in interpreting past studies and designing future ones.
4. Overlap between music processing and other aspects of speech / language
The current study investigated the role of the language network—which supports ‘high-level’ comprehension and production—in music processing. As a result, the claims we make are restricted to those aspects of language that are supported by this network. These include the processing of word meanings and combinatorial (syntactic and semantic) processing, but exclude speech perception, prosodic processing, higher-level discourse structure building, and at least some aspects of pragmatic reasoning. Some of these components of language (e.g., pragmatic reasoning) seem a priori unlikely to share resources with music. Others (e.g., speech perception) have been shown to robustly dissociate from music (Norman-Haignere et al. 2015; Overath et al. 2015; Kell et al. 2018; Boebinger et al. 2021). However, some components of speech and language may, and some do, draw on the same resources as aspects of music. For example, aspects of pitch perception have been argued to overlap between speech and music based on behavioral and neuropsychological evidence (e.g., Wong and Perrachione 2007; Perrachione et al. 2013; Patel et al. 2008b). Indeed, brain regions that selectively respond to different kinds of pitched sounds have been previously reported (Patterson et al. 2002; Penagos et al. 2004; Norman-Haignere et al. 2013, 2015). Some studies have also suggested that music training may improve general rapid auditory processing and pitch encoding that are important for speech perception and language comprehension (e.g., Overy 2003; Tallal and Gaab 2006; Wong et al. 2007), although at least some of these effects likely originate in the brainstem and subcortical auditory regions (e.g., Wong et al. 2007). Other aspects of high-level auditory perception, including aspects of rhythm, may turn out to overlap as well, and deserve further investigation (see Patel 2008, for a review).
We also have focused on Western tonal instrumental music here. In the future, it would be useful to extend these findings to more diverse kinds of music. That said, given that individuals are most sensitive to structure in music with which they have experience (e.g., Cuddy et al. 1981; Cohen 1982; Curtis and Barucha 2009), it seems unlikely that music from less familiar traditions would elicit a strong response in the language areas (see Boebinger 2021, for evidence that music-selective areas of the auditory cortex respond to culturally diverse music styles). Further, given that evolutionarily early forms of music were likely vocal (e.g., Trehub 2003; Mehr 2017), it would be useful to examine the responses of the language regions to vocal music without linguistic content, like humming or whistling. Based on preliminary unpublished data from our lab (available upon request), responses to such stimuli in the language areas appear low.
In conclusion, we have here provided extensive evidence against the role of the language network in music perception, including the processing of music structure. Although the relationship between music and aspects of speech and language will likely continue to generate interest in the research community, and aspects of speech and language other than those implemented in the core fronto-temporal language-selective network (Fedorenko et al. 2011; Fedorenko and Thompson-Schill 2014) may indeed share some processing resources with (aspects of) music, we hope that the current study helps bring clarity to the debate about structure processing in language and music.
Data availability
The datasets generated during and/or analyzed during the current study are available in the OSF repository: https://osf.io/68y7c/.
Code availability
Scripts for statistical analysis are available at: https://osf.io/68y7c/.
Author contributions
Conflict of interest
The authors declare no competing financial interests.
Acknowledgements
We would like to acknowledge the Athinoula A. Martinos Imaging Center at the McGovern Institute for Brain Research at MIT, and its support team (Steve Shannon and Atsushi Takahashi). We thank former and current EvLab members for their help with fMRI data collection (especially Meilin Zhan for help with Experiment 4). We thank Josh McDermott for input on many aspects of this work, Jason Rosenberg for composing the melodies used in Experiments 2 and 3, and Zuzanna Balewski for help with creating the final materials used in Experiments 2 and 3. For Experiment 3, we thank Vitor Zimmerer for help with creating the grammaticality judgment task, Ted Gibson for help with collecting the control data, and Anya Ivanova for help with Figure 2. For Experiment 4, we thank Anne Cutler, Peter Graff, Morris Alper, Xiaoming Wang, Taibo Li, Terri Scott, Jeanne Gallée, and Lauren Clemens for help with constructing and/or recording and/or editing the language materials, and Fatemeh Khalilifar, Caitlyn Hoeflin, and Walid Bendris for help with selecting the music materials and with the experimental script. Finally, we thank the audience at the Society for Neuroscience conference (2014), the Neurobiology of Language conference (virtual edition, 2020), Ray Jackendoff, Dana Boebinger, and members of the Fedorenko and Gibson labs for helpful comments and discussions. RR was supported by NIH award F32-DC-015163. SNH was supported by a graduate NSF award, as well as postdoctoral awards from the HHMI / Life Sciences research foundation and a K99/R00 award from the NIH (1K99DC018051-01A1). SMM was supported by a La Caixa fellowship LCF/BQ/AA17/11610043. RV was supported by Alzheimer’s Society and The Stroke Association. EF was supported by the R00 award HD057522, R01 awards DC016607, DC016950, and NS121471, by the Paul and Lilah Newton Brain Science Award, and funds from the Brain and Cognitive Sciences department, the McGovern Institute for Brain Research, and the Simons Center for the Social Brain.
Footnotes
↵† Co-senior authors
Clarified a few ideas in the introduction and discussion and added a few additional supplementary analyses.
↵1 Although some have discussed the notions of ‘meaning’ in music (e.g., Meyer 1961; Raffman 1993; Cross and Tolbert 2009; Koelsch 2001), it is uncontroversial that music cannot be used to express propositional thought (for discussion, see Patel 2008; Jackendoff 2009; Slevc 2009).