An open-access EEG dataset for speech decoding: Exploring the role of articulation and coarticulation

Electroencephalography (EEG) holds promise for brain-computer interface (BCI) devices as a non-invasive measure of neural activity. With increased attention to EEG-based BCI systems, publicly available datasets that can represent the complex tasks required for naturalistic speech decoding are necessary to establish a common standard of performance within the BCI community. Effective solutions must overcome various kinds of noise in the EEG signal and remain reliable across sessions and subjects without overfitting to a specific dataset or task. We present two validated datasets (N=8 and N=16) for classification at the phoneme and word level and by the articulatory properties of phonemes. EEG signals were recorded from 64 channels while subjects listened to and repeated six consonants and five vowels. Individual phonemes were combined in different phonetic environments to produce coarticulated variation in forty consonant-vowel pairs, twenty real words, and twenty pseudowords. Phoneme pairs and words were presented during a control condition and during transcranial magnetic stimulation targeted to inhibit or augment the EEG signal associated with specific articulatory processes.


Figure 1. Session organization. (A)
The experiment was conducted in three sessions that were held on separate days. Participant eligibility was confirmed in the first session. MRI scanning and TMS-EEG data collection were conducted independently. (B) Additional participants and trial types were included in 2021.

97
MRI scanning. 98 Scanning was conducted in the UCLA Center for Cognitive Neuroscience with a Siemens Prisma-FIT 3T Scanner. Participants 99 were provided with ear protectors and headphones for a 45 to 60 dB reduction of the noise associated with scanning, thus 100 ensuring that participants could hear the stimuli clearly and that the noise level was not uncomfortably loud. Participants were 101 asked to lie with their head motionless during all scanning procedures. High-resolution anatomical images were acquired, 102 followed by a functional scan in which participants were directed to either relax passively while looking at a fixation cross or to perform the button-press phoneme discrimination task. Functional data was acquired in a block design with a BOLD-weighted 104 echoplanar imaging sequence aligned in parallel to the bicommissural plane, yielding 36 slices covering the whole brain, each 3 105 mm thick with a 1 mm gap between slices. Each slice was acquired as a 64 x 64 matrix yielding an in-plane resolution of 1.5 x 106 1.5 mm. The total duration of the scanning session was 40 minutes.

108
The TMS-EEG procedure was conducted in the Neuromodulation Division of the Semel Institute for Neuroscience and Human  The appropriate stimulation intensity for TMS studies is determined on an individual basis 56 . Prior to the experimental 116 session, the motor threshold (rMT) of each participant was determined by eliciting motor-evoked potentials (MEPs) in the first 117 dorsal interosseus (FDI) muscle of the dominant hand at the minimum amount of stimulation needed to evoke an MEP in a 118 hand muscle after a single pulse over M1. Single TMS pulses were delivered to locations in the motor cortex contralateral to 119 the dominant hand. The intensity of the stimulation was gradually lowered until reaching a level of stimulator output at which 5 120 out of 10 MEPs in the hand muscle had an amplitude of at least 50 microvolts. Potentials evoked during TMS represent the net 121 sum of excitatory and inhibitory stimulation effects 57-59 . The literature has found that excitation increases at intensities of 122 110-120% rMT. In accordance with our reference study, stimulation was administered at 110% of the FDR rMT 54 . A physician 123 observed the motor thresholding procedure to ensure that no negative effects were incurred by participants.

4/16
design of our reference study 54 to ensure an excitatory neural response that would translate into task facilitation. Each trial 143 delivered paired TMS pulses at one of the stimulation targets, separated by a short interpulse interval (50 ms). Excitation of the 144 cortical region not involved in stimulus production (i.e., TMS at LipM1 during alveolar phoneme presentation) results in neural 145 noise that interferes with the perception task. The audio stimulus followed 50 ms after the second TMS pulse. One target was 146 stimulated per run (counterbalanced across participants). Details of the experimental protocol are illustrated in Figure 4.

147
Participants listened to audio clips immersed in 500 ms of white noise. The white noise created a mild background 148 distraction for participants to ensure that they did not perform the phoneme discrimination task at ceiling. Participants were 149 instructed to respond as fast as possible with a button press after they had identified the phoneme. In the case of multiple button 150 presses, correct trials were determined from the initial button press. Participants who exhibited a non-random response strategy 151 (i.e., failure to select from the full set of phonemes) were excluded. In 2021, participants were instructed to listen to single 152 phonemes without TMS and to repeat the sound they heard immediately after stimulus presentation (300 ms from trial onset).

153
Two lists of stimulus items were used with one list assigned to each block. In 2019, the runs were split into two blocks. The 154 first block presented CV pairs, followed by a block of VC pairs. In 2021, four blocks were administered per run. The first two 155 blocks presented CV pairs, followed by two blocks of CVC stimuli (real and pseudowords). A five-minute break was provided 156 between runs. Participants completed 120 trials in each run: 80 with TMS and 40 random catch trials. In 2021, each run of the

159
Minimal modifications to this procedure were made during the intake and scanning sessions. For the initial assessment, 160 half of the task was administered. During fMRI data collection, the full-length task was administered. However, stimuli were 161 presented in blocks of the same type (bilabial, alveolar, real words, pseudowords) to aid in their cortical localization.

163
Classification of acquired data.

164
The procedure required sustained attention during a lengthy TMS procedure. The mean reaction time and standard deviation 165 were calculated to confirm that participants were attentive to the task throughout the procedure. These metrics are documented 166 in .csv files uploaded to the data repository. In the 2019 dataset, some variation in trial numbers is observed due to missed 167 trials and rotation in the list of stimuli administered to each participant. No subjects performed less than 90% of the total list, 168 with the exception of P04 in the VC condition with LipTMS. Here, excluded trials resulted from missed trials. In the 2021 169 dataset, all trials were uploaded irrespective of a button-press response. Two subjects performed an abbreviated list of phoneme 170 triplets, and one also performed an abbreviated list of single phonemes. The number of tagged trials is shown in Table 1.  Resampling and filtering 199 The data were resampled to 256Hz and two filters were applied: a notch filter with cutoff frequencies at 59Hz and 61Hz and a high-amplitude artifacts. To do so, the spectral power density (PSD) plots generated from the data were analyzed, in addition to 209 the EEG recordings themselves. Using the EEGLAB interface, the unwanted segments were selected and removed from the 210 analysis. After each round, ICA decomposition was applied using the library-adapted infomax ICA algorithm 65 . During ICA, 211 35 components were inspected and those that clearly exhibited an artifact signal were removed.

213
All of the data files may be accessed at the Open Science Framework repository (OSF). Files are grouped by the year in which 214 the dataset was collected: 2019 is labeled Study 1, and 2021 data is labeled Study 2. Each of these primary folders contains 215 a second folder labeled "Data Descriptors", and within each dataset folder are located dedicated subfolders for the raw data, 216 processed data, and trial characteristics. The raw and processed data are grouped individually, with one subject per folder, and 217 labeled according to Table 1.

218
Raw and pre-processed EEG data.

219
Raw EEG files were stored in the .cnt format. This format contains continuous EEG recordings saved over the EEG-TMS 220 sessions. 66 channels were recorded, with electrode placement according to Figure 4. Pre-processed EEG data has also been 221 made available in .set and .mat files, according to steps described in Figure 5.

222
Event timestamps and behavioral data.

223
For each trial, event timestamps are provided in .csv format, with one file for each recording session (Fig. 6). The events 224 include (i) the second (final) TMS pulse of the pair, (ii) the sound stimulus onset, and (iii) the subsequent phoneme onsets. In 225 addition to timestamps, the files provide labels for presented (true) and identified sound stimuli (phoneme or real/nonce word).

227
Two sets of analyses were performed to support the technical quality of the datasets. Firstly, we extracted the grand mean 228 event-related potentials (ERPs) by means of independent component analysis (ICA) 65 to illustrate evidence of a stimulus-locked 229 response across participants in each condition. We selected this method primarily due to its widespread use in the investigation 230 of human cognitive information processing and therefore its familiarity among the electrophysiology research community.

231
However, ICA can be subjective in its implementation by individual researchers, and the method may not be ideal for the 232 analysis of specific types of data 66, 67 . In particular, substantial attention has been paid to the need to remove the TMS artifact 233 from TMS-EEG data 68-70 . Therefore, we performed a second analysis with delay differential analysis (DDA) 71-73 , a non-linear 234 signal processing technique that requires minimal pre-processing and is noise insensitive 74, 75 . The two analyses provide Figure 7. Expected waveforms. All plots were modified and reproduced with permission from the publishers. (A) The auditory-evoked potential (AEP) for stop (/d/, /t/) consonant-vowel pairs exhibits a small N100 potential followed by a larger P200 potential. Variation in timing will occur depending on the stimulus type and in the presence of noise 76 . (B) Vowels and consonants each produce a unique waveform 77 , such that the overall shape is dependent on the contribution of each to the waveform. (C) The shape of the TMS-evoked potential (TEP) will differ according to the cortical region targeted. Whether TMS creates an excitatory or inhibitory response can be observed in the shape of the resulting TEP. In the motor cortex, greater activity between 25-125 ms accompanies excitatory paradigms 78 . This figure illustrates the characteristic shape of the waveform for each type of neural response in the line and its standard deviation in the darker envelope.

9/16
deviation for each of the 61 channels separated by the participant is represented in Figures 8 and 9 respectively. The reference 243 pictures provided in Fig. 7 are meant to provide general guidance in interpreting the waveform; please refer to the cited papers 244 for their original findings. We observe that the ERPs approximate the expected auditory-evoked potential (AEP) induced by 245 phoneme pairs composed of stop consonants and vowels (see 7A, B). Deviations from the anticipated AEP may occur due 246 to noise (our stimuli were immersed in white noise) and the exact combination of consonants and vowels in each stimulus 247 item 76, 77 . The shape of the TMS-evoked potential (TEP) will depend on the number of pulses delivered, the interpulse interval, 248 and whether stimulation is subthreshold or suprathreshold. A wide variety of TMS paradigms have been tested with conflicting 249 results, such that it may be better to observe the TEP in order to identify whether the paradigm was excitatory or inhibitory, or 250 to consider the effect by means of an additional measure, such as a behavioral task (see 7C) 78 . The TMS paradigm used for 251 collection of the dataset produced a facilitatory effect on performance in a phoneme perception task 51, 54 .
252 Figure 9 provides an overview of the analyzed data in which each recorded channel is represented by its mean and standard Delay differential analysis.

258
Delay differential analysis (DDA) is a signal processing technique that combines differential embeddings with linear and 259 nonlinear nonuniform functional delay embeddings. The integration of nonlinear dynamics allows information from the data to 260 be detected which may not be observable in traditional linear methods. DDA requires minimal pre-processing, which eliminates The DDA model used in this analysis iṡ where x i = x(t − τ i ). In this analysis, the fixed parameters are the same as in Ref 85 . We found that one of the free parameters, 269 namely a 3 , can be used to describe neural activity in a manner similar to ERPs. However, an ERP and a 3 are not strictly the In the current analysis, the delays are τ 1 = 6 δt and τ 2 = 16 δt, with δt = 1 f s , where the sampling rate is f s = 2000 Hz.

273
These are double to the delays in Ref 85 because the sampling rate is double. The window length is 30 ms and the window shift 274 is 1 ms. In Fig. 10, we observe waveforms that display the same dynamics as the reference studies in Fig. 7. We observe neural 275 activity 200 ms and 400 ms after stimulus-onset, the same time window where activity is observed in Fig. 7A. We also observe 276 a sharp spike in activity 25-125 ms after the final TMS pulse, which corresponds to the results illustrated in Fig. 7B. This 277 finding suggests excitatory activity.

278
Limitations and final remarks.

279
This dataset is intended to provide researchers with a means to systematically test the classification accuracy of speech decoding 280 models against naturalistic speech stimuli of increasing complexity, within and across datasets that manipulate the cortical state 281 of participants. To our knowledge, this is the first EEG dataset for neural speech decoding that (i) augments neural activity 282 by means of neuromodulation and (ii) provides stimulus categories constructed in accordance with principles of phoneme 283 articulation and coarticulation. Nonetheless, several limitations of the dataset can be noted.

284
First of all, the experimental task involves aspects of comprehension, production, and motor activity (in the form of a 285 button-press response), which may be subject to some overlap in the neural signal. In particular, in single phoneme trials, 286 speech may have been produced while potentials relevant to comprehension of the speech sound were still ongoing. However, it 287 is well known that the neural networks underlying motor and language functions are not strictly dissociable. They are frequently 288 coactivated, even in covert speech or comprehension paradigms 45,46 . Therefore, we believe this phenomenon underlies most if 289 not all speech decoding paradigms, to a greater or lesser degree, and likely represents neural processing in a naturalistic context.

290
Secondly, inner speech is widely adopted in the speech decoding literature, where it is often considered to be the most 291 intuitive way of controlling a BCI device. However, inner speech decoding paradigms may not accurately mark the onset 292 of individual stimuli or component phonemes in the recorded data. We believe that prior to transitioning to an inner speech 293 paradigm, researchers would benefit from developing models that can target specific features of the speech stream, such as 294 articulatory features and coarticulation. This kind of systematic study of the speech input may lead to more robust models 295 overall and a better understanding of how this process occurs, rather "black box" models that must be trained on huge amounts their assistance in collecting the data. Figure