The different brain areas occupied for integrating information of hierarchical linguistic units: a study based on EEG and TMS

Human linguistic units are hierarchical, and our brain responds differently when processing linguistic units during sentence comprehension, especially when the modality of the received signal is different (auditory, visual, or audio-visual). However, it is unclear how the brain processes and integrates language information at different linguistic units (words, phrases, and sentences) provided simultaneously in audio and visual modalities. To address the issue, we presented participants with sequences of short Chinese sentences through auditory or visual or combined audio- visual modalities, while electroencephalographic responses were recorded. With a frequency tagging approach, we analyzed the neural representations of basic linguistic units (i.e., characters/monosyllabic words) and higher-level linguistic structures (i.e., phrases and sentences) across the three modalities separately. We found that audio-visual integration occurs at all linguistic units, and the brain areas involved in the integration varied across different linguistic levels. In particular, the integration of sentences activated the local left prefrontal area. Therefore, we used continuous theta-burst stimulation (cTBS) to verify that the left prefrontal cortex plays a vital role in the audio-visual integration of sentence information. Our findings suggest the advantage of bimodal language comprehension at hierarchical stages in language-related information processing and provide evidence for the causal role of the left prefrontal regions in processing information of audio-visual sentences.


Introduction
Language allows us to communicate ideas, feelings, and needs and is the primary mark that distinguishes us from other species [1,2]. Since human language is generally multisensory, it is a significant challenge to clarify how continuous language is perceived and integrated to build the meaning of sentences. Language comprehension involves integrating multisensory information (typically from auditory and visual modalities) to access meaning. Multisensory integration, defined as brain reactivity in response to the combination of signals from different modalities, is dynamic and context-dependent [3,4]. To understand continuous speech, listeners must construct a linguistic structure at different hierarchies, including words, phrases, and sentences. Understanding naturally connected sentences depend on interconnections between word, phrase, and sentence processing. The three levels of linguistic units are different in terms of their functions in communication. It is generally believed that 'sentence' is the basic unit for speech communication, while 'phrase' and 'word' are standby units of language communication [5,6]. However, there isn't any neuro processing evidence to support this theory. Thus, an intriguing question is whether the combined audio-visual presentation can enhance the information processing of different linguistic units in naturally connected sentences.
An increasing body of studies has explored the mechanisms underlying audiovisual integration of letters and speech sounds [7][8][9] and consistently demonstrated the superiority of audio-visual integration over processing letters and speech sounds separate [10,11]. For example, a magnetoencephalography (MEG) study showed enhanced brain activity predominantly in the right temporal-occipital-parietal junction and the left and right superior temporal sulci for audio-visual integration of phonemes and graphemes [12]. This integration resulted in the reduced response of audio-visual (AV) stimuli in comparison with summated responses to unimodal auditory (A) and visual (V) stimuli in this study (i.e., AV < sumAV), which was interpreted as suppressive interaction. Thus, audio-visual integration can also be examined by brain responses evoked by audio-visual stimuli with the sum of responses to unimodal stimuli. If auditorily and visually presented synchronous stimuli were processed independently, then the neural responses induced by an audiovisual stimulus should be approximately close to the sum of the responses to unimodal stimuli presented separately. However, if the bimodal response differs in a supra-additive or sub-additive manner from the sum of the two unimodal responses, this is attributed to the interaction between the two modalities that can integrate the information. Atteveldt et al. used functional magnetic resonance imaging (fMRI) to investigate the integration of letters and speech and found the simultaneous presentation of auditory and visual stimuli modulated related activity in the superior temporal sulcus [7].
Since most studies concerning audio-visual integration have focused on the single letter-speech sound mapping in alphabetic languages, audio-visual integration on sentence-level has remained unveiled. In essence, the information of language is mostly conveyed at sentence-level, and the brain utilizes the more complicated scheme to process the language-related information in sentence-level compared to the other two lower levels. This research modified the experimental paradigm of Ding et al. [13] to study the audio-visual integration mechanism of different language units.
To understand connected language, however, humans have to learn to construct a hierarchy of linguistic structures, including words/syllables, phrases, and sentences [14]. Cortical activity is synchronized with the acoustic features of speech approximately at the syllabic rate, which provides an initial timescale for speech processing, as well as for possibilities to explore its potential mechanism [15][16][17][18][19].
Sheng et al. compared neural activity synchronized with syllabic, phrasal, and sentential linguistic units in the frequency domain [20]. The superior temporal gyrus was found to be involved in the processing of the three linguistic units, while the activity in the motor cortex was associated with the processing of the rhythm of monosyllabic words, and both the left anterior temporal cortex and left inferior frontal gyrus were involved in the processing of phrases or sentences [13,20].
Chinese is a logographic language comprising both auditory (syllabic) and visual (graphemic) characters. When perceiving Chinese characters, Chinese speakers usually integrate multisensorial information, that is, visual and auditory features, and construct a hierarchy of different linguistic structures, including words, phrases, and sentences [12,13,20]. Yet, it remains an open question whether audiovisual integration is superior over single auditory or visual processing when the brain simultaneously handles the three linguistic structures at different timescales [21][22][23][24].
Therefore, in the present study, we hypothesized that audio-visual integration outperforms unimodal processing in logographic languages. Thereafter, the corresponding electroencephalographic responses at different timescales [25,26] were collected with electroencephalography (EEG) in a hierarchical linguistic sequence paradigm (Fig 1). To identify the potential differences, we further compared participants' behavior and multiple aspects of brain responses, including spectral (frequency) [13,20], time (event-related potentials (ERPs)) domains, as well as the brain networks involved, between the auditory, visual, and audio-visual conditions ( Fig 2). Finally, based on the results, the related brain areas specifically for language information integration will be revealed. To validate the role of those brain areas for information integration, we enrolled another independent group to attend our second experiment consisting of sham stimulation and cTBS (Fig 3). Unlike the actual stimulation exerted on the brain site by cTBS, sham stimulation only places the transcranial magnetic stimulation (TMS) device on the brain site and does not give the proper stimulation. Then, after the stimulation, the participants attend the experiment following the protocol in Experiment 1, during which EEG and behavior responses are recorded, aiming to probe whether the participant's capability to understand the language will be influenced when the critical brain areas for language information integration is modulated.

Power spectrum
The monosyllabic words, phrases, and sentences were presented every 0.25 s, 0.5 s, and 1 s, respectively. As illustrated in Fig 5, the frequency of brain responses to each linguistic unit was synchronously tagged, that is, 4 Hz for words, 2 Hz for phrases, and 1 Hz for sentences. Concerning the power spectrum at the electrode Cz,  The grand-averaged power spectrum at the electrode Cz in the auditory, visual, and audio-visual conditions. Neural tracking of syllabic, phrasal and sentential rhythms was reflected by spectral peaks at corresponding frequencies.
As seen in Fig 6, there were significant scalp topography differences in the power for the three hierarchical units among auditory, visual, and audio-visual conditions. For word processing, significant differences between the audio-visual and visual conditions (Fig 6A, left panel)      showing reduced P200 during audio-visual integration (t = -2.72, p < 0.012). In terms of the scalp distribution of AV minus sumAV difference for the P200, in the left frontal area and central scalp region, the amplitude of P200 was smaller under AV than sumAV (see the right panel of Fig 8).

Patterns of brain network in different conditions
The identified differences of the network architectures between the audio-visual condition and the unimodal visual condition (p < 0.01, paired t-tests; Fig 9) showed enhanced linkages in the audio-visual condition over widely distributed scalp areas.
The linkage enhancements were left-hemisphere dominant, especially in relation to the unimodal auditory condition. No decreased linkages in the audio-visual condition compared with the unimodal condition and no significant linkage differences between the visual and auditory conditions were observed. The difference network topologies between the bimodal and unimodal conditions. The blue lines denote the edges with statistically stronger (p < 0.01; paired t-tests) linkages in the audiovisual condition than in the visual or auditory condition.

Experiment 2
Experiment 1 identified two nodes (i.e., AF7 and FC5 revealed by power spectrum analysis) that played a critical role in the audio-visual integration of sentential structure. To validate whether behavioral and electrophysiological responses would change when the hub nodes were modulated, we conducted Experiment 2 where TMS was employed to regulate the activity at the concerned nodes. Previous evidence had shown that when TMS was applied to AF7, the stimulation could be transferred to modulate the activity at the left prefrontal region [39]. There is also research reporting that the function in the precentral gyrus was affected by giving TMS application to FC5 [40]. Therefore, we expected the changes both in their behavior and EEG responses to an audio-visual stimulus will be observed when the activities of the two nodes are suppressed by TMS, which is mainly attributed to the disturbance of the audio-visual information integration for sentential structure.

Behavioral differences
In term of ACC, the interaction between TMS condition and modality conditions was significant (F (2,28)

ERPs results
The ANOVA on P200 amplitudes showed that a marginally significant interaction between TMS and modality conditions (F (2,28)

Discussion
Is the combination of auditory and visual inputs more conducive to language processing than unimodal inputs alone? In the present study, we addressed the questions in an experiment combining auditory (syllabic) and visual (graphemic) presentations of different hierarchical linguistic units of Chinese, given that understanding how the different levels of linguistic units are represented in the brain is the key to clarify the neural basis of language comprehension [13]. To these ends, first, the present study investigated the possible advantages of audio-visual integration for language-related information processing, and also probed the specific brain areas involved in this integration for the different hierarchical linguistic units, in terms of behavioral performance, EEG-based power spectrum, ERPs, and functional brain networks. Second, TMS was used to suppress modulate the activity of two hub nodes (AF7 and FC5) identified in the power spectrum analysis, aiming to validate the role of these two brain sites during the audio-visual integration of sentential structures.
In Experiment 1, participants responded more accurately and faster to audiovisual stimuli than they did in either of the unimodal conditions, indicating that spoken syllables and their orthographic information were successfully integrated and facilitated linguistic processing in the audio-visual condition. Motivated by the close relationship between electrophysiological activity and behavior in previous studies [25,41], we further probed how syllabic and glyphic information is integrated into the brain.
We observed power spectrum peaks at 1 Hz, 2 Hz, and 4 Hz, consistent with previous studies [13,20] and correspond to the rates of the sentence, phrase, and monosyllabic word presentations, respectively. While previous studies found information integration of language for monosyllabic words and phrases [21], audiovisual integration on the sentence level has remained unveiled. The current results provided evidence in this aspect, showing that the most robust responses at 1 Hz, 2 Hz, and 4 Hz for the audio-visual condition, suggesting that audio-visual integration not only exists for syllables (monosyllabic words) and phrases but also for the higher sentence-level linguist processing.
Though information integration was observed for all the three linguistic units, the scalp topography results showed that the different linguistic units involved different brain areas for the information integration. Specifically, in the audio-visual condition as compared to the unimodal condition, the processing of words led to more significantly stronger activation in the parietal areas. This is in line with a previous MEG study in Finnish school children that emphasizes the crucial role of the parietaltemporal cortex in the early phase of reading [9]. The parietal areas may be involved in early audio-visual integration [12,42,43]. For the phrasal processing, the audiovisual integration recruited the left prefrontal and bilateral parietal areas. The literature has shown that the left prefrontal area is primarily engaged in the processing of basic syntactic/semantic combinations [43]. Our result further verified the involvement of syntactic processing in phrases under the audio-visual modality. The topological activation with regard to sentences showed stronger responses in the left prefrontal area under the audio-visual relative to unimodal conditions. The region has not been ever found to be engaged in audio-visual integration concerning other types of linguistic units, including the letter-speech-related bimodal integration, which may be the specific brain area for audio-visual information integration of language [44][45][46][47][48]. The area with increased activation is part of Broca's area. This is also consistent with the previous finding that the syntactic and semantic processing of sentences is associated with Broca's area [23,[49][50][51]. Taken together, while audio-visual integration occurs at all the linguistic levels, the brain areas involved in the integration were different across the levels. Intriguingly, the processing of words and phrases showed some overlap in terms of brain activation, while sentential processing is rather different compared with them.
The results also indicate different degrees of hemispheric lateralization when integrating information in different hierarchical linguistic units. In detail, the integration of basic syllabic processing at the pre/post-central areas showed no hemispherical lateralization, potentially a motor-sensory network, which is consistent with previous studies reporting that the audio-visual integration of syllables is no significant hemispheric laterality [12]. However, for higher-level linguistic units, such as phrases and sentences, stronger activation due to information integration was observed in the left frontal and parietal areas, which is in line with previous findings that syntactic and semantic information is mainly processed in the left prefrontal and the left parietal areas [50,52].
In terms of ERPs, the audio-visual condition showed larger amplitudes of the P200 as compared to the unimodal conditions. The P200 component, in general, is known to reflect the allocation of attention [53,54]. When the brain processes audiovisual stimuli, both visual and auditory attention resources are needed [42].
Therefore, more attention may be allocated to the audio-visual modality relative to unimodal stimuli here, leading to enhanced amplitudes of the P200. The P200 has also been shown to be sensitive to semantic priming [55,56]. The semantic priming may indeed require the conscious linking of related representations. Such a mechanism would be superfluous for cross-modal repetition priming since the visual glyphs, and auditory speech presumably has the same semantic representation. It is likely that audio-visual integration in language processing requires more cognitive resources than unimodal processing to facilitate semantic/syntactic understanding [42,57,58] and thus produced larger P200 amplitudes. In agreement with previous evidence for suppressive interaction in the auditory and visual processing of audiovisual stimuli [12]. the P200 responses to audio-visual stimuli in the present study were smaller than the sum of ERPs elicited by both types of unimodal stimuli (sumAV > AV). This sub-additive effect may reflect the facilitation of auditory and visual processing due to the audio-visual presentation of the same stimulus. In the topographic AV-sumAV map for the P200, the AV < sumAV areas were predominantly over the left frontal and central regions. This left hemisphere dominance in scalp distribution might be related to linguistic audio-visual integration.
When children learn to read, written and spoken words are often presented together and neural pathways that enable them to memorize and retrieve audio-visual associations are formed. Consequently, the audio-visual suppression effect can be interpreted as optimization of neural networks during learning [12].
In addition to local brain activity reflected by power spectrum responses, we conducted the brain network analysis, which could structure how the information is propagated among the related brain areas. Consistent with increased activities for audio-visual stimuli, we identified increased network patterns in the audio-visual relative to unimodal modality, namely, simultaneous integration of visual and auditory information recruited more brain sources responsible for language-related information processing. Another aspect revealed is that the enhanced linkages under audio-visual conditions are exhibited with the lateralization in the left hemisphere, which is consistent with the observed lateralization for the power spectrum. Despite increased network patterns for the audio-visual stimuli as compared to both visual and auditory stimuli, the increased network patterns differed when comparing the bimodal modality with either of the two single modalities. Compared with the differences between the audio-visual and auditory stimuli, the differences between the audio-visual and visual stimuli were more pronounced. This difference suggests that when language information is visually represented, the linguistic information is not efficiently processed in the brain that evoked a less efficient network, while the audio stimulus evoked a more efficient network for information processing.
Behaviorally, the accuracy in visual stimulus presentation was associated with the lowest accuracy, which may indicate that the visual stimulus is not competitive for language processing. In the visual modality, the pronunciation of words is activated first from visual glyphs and then from the pronunciation of words to the meaning, and therefore the visual modality is less efficient than the auditory modality, which allows the linguistic information to be processed directly from pronunciation to meaning [59]. Moreover, the auditory modality evoked the stronger P200 compared to the visual modality as shown in Fig 7, which suggests that the auditory modality may recruit more brain resources and thus promote processing. This superiority of the auditory modality for language to the visual modality for language processing may be attributed to the usual way of language acquisition that an infant initially receives the language from hearing [60], and the auditory properties of language during the communications [61].
The present study (Experiment 2) investigated the causal role of the left prefrontal regions in the audio-visual integration in sentence processing. Consistent with our hypothesis, the inhibition of the left prefrontal region by cTBS changed the behavioral and electrophysiological responses. When cortical excitability of these areas was decreased via TMS, a significant increase in RT and a decrease in ACC were observed, indicating a reduction in audio-visual integration. Following the cTBS intervention (over the electrodes AF7 and FC5), compared with the sham condition, the P200 and power spectrum were significantly reduced. Given that the P200 has been associated with the allocation of attention resources and semantic priming [53,55], smaller P200 amplitudes following cTBS may suggest that the stimulation cuts down on cognitive resources that the audio-visual integration in language processing demands and thus undermines semantic judgment. When TMS was applied to AF7, compared with the sham condition, there was a significant drop of the sentence-level power spectrum (1Hz) in the audio-visual modality. Although the power spectrum of word-level (4Hz) and phrase-level (2Hz) decreased, there was no significant TMS effect under the audio-visual condition. Moreover, for the electrodes FC5, there was no statistical difference in sentence-level (1Hz) power spectrum under audio-visual modality compared to the sham condition. These results confirm that the left prefrontal cortex may also be responsible for audio-visual integration at the sentence level.

Conclusion
The ability to read is a major landmark process in human cognitive development.
Cognitive scientists and psycholinguists have maintained that learning to read requires skills in orthographic, phonological, and semantic facets of printed words.
Therefore, learning to read involves the integration of multisensory (syllabic and graphemic) information to access meaning. When children start understanding written language, they typically learn to associate the sounds of their spoken language with unfamiliar characters in the logographic language and finally access the meaning of visual glyphs. The results of Experiment 1 based on multiple measurements consistently identified the superiority of audio-visual integration over a single auditory or visual modality in hierarchical linguistic units. Higher ACC and shorter RT, concurrent with enhanced EEG power spectra and their topological distributions, larger P200 amplitudes, the difference of sumAV and AV for P200, and more network linkages, suggested that the audio-visual modality could facilitate the semantic/syntactic understanding of Chinese. In Experiment 2, after TMS was used to suppress the activity of the nodes obtained by power spectrum analysis, we observed the significant changes in behavioral responses and electrophysiological indices for audio-visual integration. Overall, our results suggest that audio-visual integration as compared to unimodality produces an advantage in processing hierarchical linguistic units in Chinese via the left prefrontal cortex and that listening and reading at the same time may be an effective way to learn and understand language, especially hieroglyphs like Chinese. Given that learning to integrate visual glyph and pronunciation is crucial to acquire the ability to read a language, our results may have important implications for the acquisition of reading skills.

Experiment 1 Participants
The Institution Research Ethics Board of the University of Electronic Science and Technology of China (UESTC) approved the experiment. We recruited 23 healthy, right-handed postgraduates (13 males, 10 females; age 24.09 ± 2.48 years) from the student population at the UESTC. They were all right-handed native Chinese speakers and had normal hearing and normal or corrected to normal visual acuity.
Participants had never used any psychoactive medication, and none had a personal or family history of psychiatric or neurological illnesses. Before the experiment, written informed consent was obtained from all participants after they fully understood the procedure.

Stimuli
The stimuli in the experiment included 50 short sentences composed of four Chinese monosyllabic words, with the first two syllables constituting a noun phrase and the last two syllables constituting a verb phrase [13]. In these sentences, noun and verb phrases were compatible. A 'normal' (standard) trial consisted of ten meaningful sentences. An abnormal trial consisted of eight meaningful sentences and two nonsense sentences derived from the meaningful sentences by reversing their subjects and predicates. For example, based on two compatible sentences, 轮船起航 (cruise ships set sail) and 青草发芽 (green grass grew bud), two nonsense sentences were made: 轮船发芽 (cruise ships grew bud) and 青草起航 (green grass set sail).
All sentences were provided in three modalities: auditory, visual, and audio-visual.
The duration of each syllable in the auditory sentences was adjusted to 250 ms and the gap between adjacent syllables was removed to avoid a potential contribution of speech rate or any other prosodic cue to linguistic structure building. Thus, the presentation rate was 4 Hz, which is close to the mean syllable rate in natural speech across languages [13,20,27]. The visual sentences consisted of four sequentially presented Chinese characters. The stimuli of spoken syllables and corresponding characters in the audio-visual condition were synchronously delivered. The speech inputs were delivered binaurally via headphones, and their intensity was about 65 dB SPL. The written characters (size: ca. 25.31 mm × 25.31 mm) were delivered in white font on a black computer screen at a distance of about 70 cm in front of the participant.

Experimental procedures
Each participant was presented with sentences in the three modality conditions (i.e., auditory, visual, or audio-visual). In each condition, Chinese monosyllabic words were presented in sequences in random order or an order forming the twophrase sentences. The sentences of one modality were presented in one run, resulting in 3 runs. Each run consists of 25 trials including 20 standard trials where ten compatible sentences were presented without acoustic gaps and 5 outlier trials where two nonsense sentences were presented among the ten sentences. The order of sentences and trials within a run was randomized. Each trial lasted a period of 12 s (Fig 1), starting with the presentation of a fixation cross for 600 ms. Following the fixation, ten two-phrase sentences were presented, each two-phrase sentence lasting for 1000 ms. After the presentation of sentences, participants were required to judge whether the trial was "normal" (standard) or not. If the trial was "abnormal" (outlier), participants should press key "1" on a keyboard within 1200 ms, while if the trial is 'normal', participants need not press any button. After a 200 ms blank period, the next trial was initiated. Before the experiment, all participants were required to try a preliminary round to ensure that they understood the rules of the experiment. There is a 3-minutes interval between two consecutive runs for rest, and the order of runs (i.e., auditory, visual, or audio-visual) are randomly presented for each participant.

Task performance
ACC and RT per condition (auditory, visual, and audio-visual) was recorded.
Both button presses to abnormal/outlier trials and omissions of button presses to normal/standard trials were regarded as correct responses.

Data acquisition
Participants were seated comfortably in an electrically shielded, dimly lit room.
EEG data were recorded with 64 Ag/AgCl electrodes (ANT Neuro, Berlin, Germany), and all electrodes were positioned according to the extended 10-20 international electrode placement system (ASA-Lab Amplifier, eemagine Medical Imaging Solutions GmbH, Berlin, Germany). The online sampling rate was 500 Hz, and the data were band-pass filtered at 0.01-100 Hz. The electrodes CPz and AFz served as the reference and ground, respectively. To monitor eye movements, an electrooculogram was recorded using an additional electrode positioned above the left eye. During the entire task, the impedances of all electrodes were kept below 5 kΩ.

Data analysis
A series of procedures consisting of pre-processing, power spectrum calculation, ERPs extraction, and network analysis were implemented (Fig 2).

Pre-processing
Reference electrode standardization technique (REST) referencing (http://www.neuro.uestc.edu.cn/rest/) [28,29], offline band-pass filtering, data segmenting, and artifact-trial removal were included to pre-process the recorded EEG. Concerning the power spectrum analysis, [0.1, 10] Hz offline band-pass filtering, [-800, 0] ms baseline correction, and artifact-trial removal with a threshold of ± 75 μV were used. For each trial, the stimulus period was defined by ignoring the first second of the stimulus to avoid the transient response [20]. Thus, the length of the segmented data was 9 s.

Power spectrum
The direct Fourier transform was applied to each artifact-free trial to calculate the power spectra per condition. Thereafter, for each modality, the corresponding power spectra were averaged across all trials to acquire the final response power [20].

ERPs analysis
Following pre-processing (Fig 2), epochs ranging from -250 to 1000 ms after the onset of the final word in a sentence were averaged for each modality. Peak detection was performed automatically, time-locked to the latency of the peak at the electrode of maximal amplitude on the grand-average ERPs. Temporal windows for peak detection were determined based on variations of the global field power measured across the scalp [30]. The P200 was defined as the mean amplitude in the 250-350 ms time window following word onset at six electrodes over the frontocentral and central areas (i.e., FC1, FCz, FC2, Cz, C1, and C2) where the P200 is classically found and displays maximal sensitivity [31,32]. ERPs to separately delivered auditory (A) and visual (V) stimuli were summed, and this sum (sumAV) was compared with audio-visual stimuli ERPs (AV) to investigate whether audio-visual integration could be reflected by the ERPs of AV stimuli i.e., sub-additivity (AV < sumAV) for information integration.

EEG network
We adopted the phase-locking value (PLV) [33,34] that can capture the nonlinear phase synchronization between paired nodes to construct language-related network. To reduce the volume conduction [35,36], we sparse 21 canonical electrodes as network nodes to construct the networks. To estimate the corresponding instantaneous phases, i.e., ϕ x (t) and ϕ y (t) of two given time series, x(t) and y(t), the Hilbert transform (HT) is used to form the analytical signal S(t) as (1) where and are the HT of two-time series, x(t) and y(t), which are defined as, where the CPV denotes the Cauchy principal value. Afterward, corresponding analytical signal phases, ϕ x (t) and ϕ y (t), can be computed as, Finally, the PLV is formulated as where denotes the by PLV value between x(t) and y(t), denotes the sampling period, and N denotes the sample number.

Statistical analysis
Repeated-measure ANOVA with modality (auditory, visual, and audio-visual) as a within-participants variable and post-hoc tests (pairwise comparisons,  [37]. None of the participants had a history of neurological or psychiatric disorders, and none of them was currently using any psychoactive medications. The experimental protocol was approved by the Institution Research Ethics Board of the UESTC. Written informed consent was obtained after the procedure had been fully explained, prior to scanning. All participants were paid ￥ 100 per hour for their participation.

Stimuli
The language-related stimuli in Experiment 2 were the same as those in Experiment 1.

Experimental procedures
We used a two × two × three factorial design with the three within-participants factors including TMS site (AF7 vs. FC5), TMS condition (effective vs. sham), and modality condition (auditory vs. visual vs. audio-visual). The experiment consisted of two site sessions (AF7 and FC5) that were performed with an inter-session interval of at least one week to avoid carry-over or earning effects, as depicts in Fig 3. Both sessions consisted of two blocks, separated by a 1 h break. The blocks differed with respect to the TMS condition (effective vs. sham). In each block, the tasks with stimuli in the three modalities (auditory, visual, and audio-visual) were performed in three runs, respectively. The participants were given a 3-minutes interval between two consecutive runs for rest, and the order of stimuli (i.e., auditory, visual, audiovisual) was randomized across participants. In each run, Chinese monosyllabic words were presented in sequences in random order or an order forming the two-phrase sentences for the three modalities. Each run consists of 25 trials including 20 standard trials and 5 outlier trials. The order of sentences and trials within a run was randomized. Participants were given a test to ensure that the rules of the task were understood before the experiment.  [38]. The sham stimulation was delivered using the cTBS protocol with the coil positioned at a perpendicular angle to AF7 or FC5 in a counterbalanced manner across participants. After cTBS or sham stimuli, participants were asked to perform the task and EEG and behavioral responses were simultaneously recorded.

Statistical analysis
RT, ACC, and P200 amplitudes were respectively subjected to three-way repeated-measures ANOVAs with TMS sites (2 levels), TMS condition (2 levels), and modality condition (3 levels) as within-participants variables and post-hoc tests were used to quantify differences between conditions. The power spectrum was analyzed using four-way repeated-measures ANOVAs with TMS sites (2 levels), TMS condition (2 levels), modality condition (3 levels), and power spectrum frequency (3 levels) as within-participants variables and post-hoc tests were used to quantify differences. The assumption of sphericity was assessed with Mauchly's test, and the Greenhouse-Geisser correction for non-sphericity was used to correct the pvalues when required. Bonferroni correction was used for multiple pairwise comparisons. Based on our hypothesis, we focused on interactions between TMS and modality conditions.

Conflict of interest
The authors declare that they have no conflict interest.