Abstract
Speech brain-computer interfaces (BCIs) have the potential to restore rapid communication to people with paralysis by decoding neural activity evoked by attempted speaking movements into text1,2 or sound3,4.Early demonstrations, while promising, have not yet achieved accuracies high enough for communication of unconstrainted sentences from a large vocabulary1–5. Here, we demonstrate the first speech-to-text BCI that records spiking activity from intracortical microelectrode arrays. Enabled by these high-resolution recordings, our study participant, who can no longer speak intelligibly due amyotrophic lateral sclerosis (ALS), achieved a 9.1% word error rate on a 50 word vocabulary (2.7 times fewer errors than the prior state of the art speech BCI2) and a 23.8% word error rate on a 125,000 word vocabulary (the first successful demonstration of large-vocabulary decoding). Our BCI decoded speech at 62 words per minute, which is 3.4 times faster than the prior record for any kind of BCI6 and begins to approach the speed of natural conversation (160 words per minute7). Finally, we highlight two aspects of the neural code for speech that are encouraging for speech BCIs: spatially intermixed tuning to speech articulators that makes accurate decoding possible from only a small region of cortex, and a detailed articulatory representation of phonemes that persists years after paralysis. These results show a feasible path forward for using intracortical speech BCIs to restore rapid communication to people with paralysis who can no longer speak.
Results
Representation of orofacial movement
It is not yet known how orofacial movements are organized in human motor cortex at single neuron resolution. If speech articulators are spatially intermixed within a small area, then accurate speech decoding should be possible from only a small number of microelectrode arrays. However, prior work using electrocorticographic grids has suggested that there may be a broader somatotopic organization8. To investigate this, we recorded neural activity from four microelectrode arrays, two in area 6v9 (ventral premotor cortex) and two in area 44 (part of Broca’s area), while our BrainGate study participant attempted to make individual orofacial movements in response to cues displayed on a computer monitor (Fig. 1a-b; SFig 1 shows recorded spike waveforms). Our participant (T12) has bulbar-onset ALS and retains some limited orofacial movement and an ability to vocalize unintelligibly when attempting to speak.
We found that tuning to speech articulators in area 6v was intermixed at the single electrode level (Fig 1c-d), and that all speech articulators and phonemes were clearly represented even within a single 8×8 array (3.2 × 3.2 mm). No single category of movement (forehead, eyelids, jaw, larynx, lips, or tongue) appeared to generate substantially more modulation than any other (SFig 2). Neural activity was highly separable between movements: using a simple naive Bayes classifier applied to 1 second of neural population activity for each trial, we could decode from among 34 orofacial movements with 92.7% accuracy (95% CI = [90.7, 94.5]) and 39 phonemes with 60% accuracy (95% CI = [56.1, 64.1]) (SFig 3).
Robust tuning to all tested movements and phonemes suggests that the representation of attempted orofacial movement is likely strong enough to support a speech BCI, despite paralysis and narrow coverage of the cortical surface. We also assessed neural tuning in area 44, which has previously been implicated in high-order aspects of speech production10,11, but found little to no representation of movement or attempted speech (SFig 2-3); all further results are based on area 6v recordings only.
Decoding attempted speech
Next, we tested whether we could neurally decode whole sentences in real-time. We trained a recurrent neural network (RNN) decoder to emit, at each 80 ms time step, the probability of each phoneme being spoken at that time. These probabilities were then processed by a language model to infer the most likely underlying sequence of words, given both the phoneme probabilities and the statistics of the English language (Fig 2a).
To train the RNN, at the beginning of each day we recorded data where T12 attempted to speak 260-480 sentences at her own pace (sentences were chosen randomly from the switchboard corpus12 of spoken English). A computer monitor cued T12 when to begin speaking and what sentence to speak. The RNN was trained on this data in combination with all prior days’ data, using custom machine learning methods adapted from modern speech recognition13–15 to achieve high performance on limited amounts of neural data. In particular, we used unique input layers for each day to account for across-day changes in the neural activity, and rolling feature adaptation to account for within-day changes (SFig 4 highlights the effect of these and other architecture choices). By the last day, our training dataset consisted of 10,850 total sentences.
After training, the RNN was evaluated in real-time on held-out sentences that were never duplicated in the training set. As T12 attempted to speak, neurally decoded words appeared on the screen in real-time reflecting the language model’s current best guess (SVideo 1). When T12 was finished speaking, she pressed a button to finalize the decoded output. We used two different language models: a large vocabulary model with 125,000 words (suitable for general English) and a small vocabulary model with 50 words (suitable for expressing some simple sentences useful in daily life). Sentences from the switchboard corpus12 were used to evaluate the RNN with the 125,000 word vocabulary. For the 50 word vocabulary, we used the word set and test sentences from Moses et al. 20212.
Performance was evaluated over 5 days of attempted speaking with vocalization and 3 days of attempted silent speech (“mouthing” the words with no vocalization, which T12 reported she preferred because it was less tiring). Performance was consistently high for both speaking modes (Fig 2b-c, Table 1). T12 achieved a 9.1% word error rate for the 50 word vocabulary across all vocalizing days (11.2% for silent), and a 23.8% word error rate for the 125k word vocabulary across all vocalizing days (24.7% for silent). To our knowledge, this is the first successful demonstration of large-vocabulary decoding, and is also a significant advance in accuracy for small vocabularies (2.7 times fewer errors than prior work2). These accuracies were achieved at high speeds: T12 spoke at an average pace of 62 words per minute, which more than triples the speed of the prior state of the art for any type of BCI (18 words per minute for a handwriting BCI6).
Encouragingly, the RNN often decoded sensible sequences of phonemes before a language model was applied (Fig 2c). Phoneme error rates were 19.7% for vocal speech (20.9% for silent; see Table 1) and phoneme decoding errors followed a pattern related to speech articulation, where phonemes that are articulated similarly were more likely to be confused by the decoder (SFig 5). These results suggest that good decoding performance is not overly reliant on a language model. We further tested the decodability of the neural activity without a language model by using a special dataset of single-word utterances from the 50-word vocabulary. When applying a simple naïve Bayes classifier to this dataset, we could achieve a 95.5% accuracy (95% CI = [94.2, 96.7]), which outperforms prior work based on electrocorticographic recordings (47.1% 2). This indicates that the performance advance demonstrated here is not due to the language model, but rather the increased decodability afforded by higher resolution intracortical recordings.
Finally, we explored the ceiling of decoding performance offline by (1) making further improvements to the language model and (2) evaluating the decoder on test sentences that occur closer in time to the training sentences (to mitigate the effects of within-day changes in the neural features across time). We found that an improved language model could decrease word error rates from 23.8% to 17.4%, and that testing on more proximal sentences further decreased word error rates to 11.8% (Table 1). These results indicate that substantial gains in performance are likely still possible with further language model improvements and more robust decoding algorithms that generalize better to nonstationary data.
Preserved representation of speech articulation
Next, we interrogated the representation of phonemes in area 6v during attempted speech. This is a challenging problem since we do not have ground truth knowledge of when each phoneme is being spoken (since T12 cannot speak intelligibly). To estimate how each phoneme was neurally represented, we analyzed our RNN decoders to extract vectors of neural activity (“saliency” vectors) that maximized the RNN probability output for each phoneme. We then asked whether these saliency vectors encode details about how phonemes are articulated.
First, we compared the neural representation of consonants to their articulatory representation, as measured by electromagnetic articulography in able-bodied speakers. We found broadly similar structure, which is especially apparent when ordering consonants by place of articulation (Fig 3A); the correlation between EMA and neural data was 0.61, far above chance (p<1e-4, Fig 3B). More detailed structure can also be seen – for example, nasal consonants are correlated (M, N, NG), and W is correlated with both labial consonants and velar/palatal consonants (since it contains aspects of both). Examining a low-dimensional representation of the geometry of the neural representation and articulatory representation shows a close match in the top two dimensions (Fig. 3c).
Next, we examined the representation of vowels, which have a two-dimensional articulatory structure: a high vs. low axis (how high the tongue is in the mouth, corresponding to the first formant frequency) and a front vs. back axis (whether the tongue is bunched up towards the front or back of the mouth, corresponding to the second formant frequency). We found that the saliency vectors for vowels mirror this structure, with vowels that are articulated similarly having a similar neural representation (Fig. 3d-e). Additionally, the neural activity contains a plane that reflects the two dimensions of vowels in a direct way (Fig. 3f).
Taken together, these results show that a detailed articulatory code for phonemes is still preserved even years after paralysis.
Design considerations for speech BCIs
Finally, we examined three design considerations for improving the accuracy and usability of speech BCIs: language model vocabulary size, microelectrode count, and training dataset size.
To understand the effect of vocabulary size, we re-analyzed the 50-word-set data by re-processing the RNN output using language models of increasingly larger vocabulary sizes (Fig. 4a). We found that only very small vocabularies (e.g., 50-100 words) retained the large improvement in accuracy relative to a large vocabulary model. Word error rates saturated at around 1,000 words, suggesting that using an intermediate vocabulary size may not be a viable strategy for increasing accuracy.
Next, we investigated how accuracy improved as a function of the number of electrodes used for RNN decoding. Accuracy improved monotonically with a log-linear trend (Fig. 4b), suggesting that an increased channel count should lead to higher accuracies in the future (doubling the electrode count appears to cut the error rate approximately in half).
Finally, in this demonstration we used a large amount of training data per day (260 – 440 sentences). Retraining the decoder each day helps to adapt to neural changes that occur across days. We examined offline if this much data per day was necessary, by re-processing the data with RNNs trained in the same way but with fewer sentences. We found that performance was good even without using any training data on the new day (Fig. 4c, word error rate = 30% with no retraining). Furthermore, we found that neural activity changed at a gradual rate over time, suggesting that unsupervised algorithms for updating decoders to neural changes should be feasible (Fig. 4d).
Discussion
People with neurological disorders such as brainstem stroke or amyotrophic lateral sclerosis (ALS) frequently face severe speech and motor impairment and, in some cases, completely lose the ability to speak (locked-in syndrome17). Recently, BCIs based on hand movement activity have enabled typing speeds of 8-18 words per minute in people with paralysis18,6. Speech BCIs have the potential to restore natural communication at a much faster rate, but have not yet achieved high accuracies on large vocabularies (i.e., unconstrained communication of any sentence the user may want to say)1–5. Here, we demonstrated a speech BCI that can decode unconstrained sentences from a large vocabulary at a speed of 62 words per minute, the first time that a BCI has far exceeded the communication rates that alternative technologies can provide for people with paralysis (e.g., eye tracking19). We were able to decode at high speeds with 2.7 times fewer errors than the prior state of the art for speech BCIs when evaluated on a matching 50-word vocabulary2, made possible by using intracortical microelectrode arrays that record neural activity at single neuron resolution.
Our demonstration is a proof of concept that decoding attempted speaking movements from intracortical recordings is a promising approach, but it is not yet a complete, clinically viable system. Work remains to be done to reduce the time needed to train the decoder and adapt to changes in neural activity that occur across days without requiring the user to pause and recalibrate the BCI (see 20–22 for initial promising approaches). Perhaps most importantly, a 24% word error rate is likely not yet low enough for everyday use (e.g., compare to 4-5% word error rate for state of the art speech-to-text systems15,23). Nevertheless, we believe that our results show promise for decreasing the word error rates further. First, word error rate decreases as more channels are added, suggesting that intracortical technologies that record more channels should enable lower word error rates. Second, room still remains for optimizing the decoding algorithm; with further language model improvements and when mitigating the effect of within-day nonstationarities, we were able to reduce the word error rate to 11.8% in offline analyses. Finally, we showed that ventral premotor cortex (area 6v) contains a rich, intermixed representation of speech articulators even within a small area (3.2 × 3.2 mm), and that the details of how phonemes are articulated are still faithfully represented even years after paralysis in someone who can no longer speak intelligibly. Taken together, this suggests that a higher channel count system that records from only a small area of 6v is a feasible path forward towards a device that can restore communication at conversational speeds to people with paralysis.
Competing Interests
The MGH Translational Research Center has a clinical research support agreement with Neuralink, Paradromics and Synchron, for which L.R.H. provides consultative input. J.M.H. is a consultant for Neuralink, and serves on the Medical Advisory Board of Enspire DBS. K.V.S. consults for Neuralink and CTRL-Labs (part of Facebook Reality Labs) and is on the scientific advisory boards of MIND-X, Inscopix and Heal. All other authors have no competing interests.
Supplemental Video 1. In this video, participant T12 uses the speech BCI in real-time to copy sentences shown on the screen. When the square in the center of the screen is red, T12 reads the sentence above the square and prepares to speak it. When the square turns green, T12 attempts to speak that sentence while the real-time decoder output is shown below the square. Note that T12 produces unintelligible vocalizations when attempting to speak. When T12 is finished speaking, a text-to-speech program reads the final decoded text aloud. These sentences were recorded during a performance evaluation session reported in Figure 2 (post-implant day 136).
Supplemental Video 2. The same as supplemental video 1, except T12 is silently speaking (i.e., mouthing the words) instead of attempting to produce vocalizations. These sentences were recorded on post-implant day 141.
1. Experimental procedures
1.1. Study participant
This study includes data from one participant (identified as T12) who gave informed consent and was enrolled in the BrainGate2 Neural Interface System clinical trial (ClinicalTrials.gov Identifier:
NCT00912041, registered June 3, 2009). This pilot clinical trial was approved under an Investigational Device Exemption (IDE) by the US Food and Drug Administration (Investigational Device Exemption #G090003). Permission was also granted by the Institutional Review Board of Stanford University (protocol #52060). T12 gave consent to publish photographs and videos containing her likeness. All research was performed in accordance with relevant guidelines/regulations.
T12 is a left-handed woman, 67 years old at the time of data collection, with slowly-progressive bulbar-onset Amyotrophic Lateral Sclerosis (ALS) diagnosed at age 59 (ALS-FRS score of 26 at the time of study enrollment). On March 30, 2022, four 64-channel, 1.5 mm-length silicon micro electrode arrays coated with sputtered iridium oxide (Blackrock Microsystems, Salt lake City, UT) were implanted in T12’s left hemisphere, based on preoperative anatomical and functional magnetic resonance imaging (MRI) and cortical parcellation (see sections 1.2 and 1.3 below for details). Two arrays were placed in area 6v (oral-facial motor cortex) of ventral precentral gyrus, and two were placed in area 44 of inferior frontal gyrus (considered part of Broca’s area). Data are reported from post-implant days 27-148. On average, 119.6 ± 5.0 (Mn ± sd) out of 128 electrodes recorded spike waveforms at a rate of at least 2 Hz when using a spike-detection threshold of -4.5 RMS, where RMS is the electrode-specific root mean square of the voltage time series recored on that electrode (see SFig 1 for example waveforms).
T12 is severely dysarthric due to bulbar ALS and has been for nearly 8 years. She retains partial use of her limbs, and communicates primarily through use of a writing board or iPad tablet. She is able to vocalize while attempting to speak, and is able to produce some subjectively differentiable vowels sounds. However, we had difficulty discerning nearly all consonants produced in isolation (with the possible exception of the bilabial nasal consonant “M”), and could not reliably make out any consonants or vowels when T12 attempted to speak whole sentences at a fluent rate (SVideo 1 shows examples of attempted speaking).
1.2. Functional MRI speech lateralization
Prior to surgery, participant T12 underwent anatomic and functional brain imaging on a GE Discovery MR750 3T MRI scanner, using a routine clinical acquisition protocol, in order to determine whether she was right or left hemisphere dominant for language. BOLD fMRI images were acquired using T2*-weighted volumes collected with 4 mm slice thickness and 2×2 mm2 in-plane voxel resolution. BOLD images were acquired during performance of a suite of tasks including visually responsive naming, object naming, auditory responsive naming, repetitive movements of the right hand, left hand, right foot, left foot, and tongue. Tasks were performed in 4 minute blocks consisting of repeated sequences of 10 seconds of task performance followed by 10 seconds of rest. Task instructions were presented by the SensaVue presentation system with verbal instructions given by the MRI technologist. T-score thresholds for processing were chosen based on direct inspection of the preliminary fMRI output for each task produced by the scanner software and inspected by the interpreting radiologist at the scanner.
Following scan completion, the complete data set was sent to and processed by DynaSuite. Fully processed fMRI statistical parametric maps for each task were registered to and overlaid on an anatomic 3D gradient echo T1-weighted (BRAVO) image acquired with 1 mm isotropic voxels. Quality control steps including motion tracking and assessment of anatomic-functional registration fidelity. Language lateralization was assessed by visual inspection of lateralization of activation in Broca’s area, Wernicke’s area, speech supplemental motor area, and the basal temporal language area. Results indicated a clear left hemisphere lateralization of language in T12.
1.3. Array placement targeting
The surgical targets for array placement within areas 6v and 44 were selected based on gross anatomical structure (e.g., gyri and sulci), vasculature, and estimates of the boundaries of areas 44 and 6v obtained using a cortical parcellation method derived from multi-modal Human Connectome Project (HCP) data [1].
We acquired T1-weighted (T1w), T2-weighted (T2w), resting-state functional MRI (rsfMRI), single band fMRI, and spin echo fieldmap images to generate the HCP-based cortical parcellation. The participant was scanned in a 3T Ultra High Performance scanner (GE Healthcare) with a Nova 32-channel coil. Scan parameters were based on HCP Lifespan protocols and modified for the GE system (Table 1).
Data were processed using the HCP pipelines as described on https://github.com/Washington-University/HCPpipelines (see Glasser et al. 2016 for further details). Briefly, T1w and T2w images were initially preprocessed using the FreeSurfer pipeline (version 7.1.1) to perform motion, distortion, and bias field corrections; brain extraction; white matter segmentation; cortical surface reconstruction, and spherical mapping (PreFreeSurferPipelineBatch.sh, FreeSurferPipelineBatch.sh). FreeSurfer out-puts were then aligned to the standard surface template using MSMSulc, as well as used to create myelin maps (PostFreeSurferPipelineBatch.sh).
rsfMRI data were corrected for motion, bias field, and susceptibility distortions using the spin echo fieldmaps and single band reference fMRI images and non-linearly registered to MNI space (GenericfM-RIVolumeProcessingPipelineBatch.sh), followed by volume to surface mapping (GenericfMRISurface-ProcessingPipelineBatch.sh). Data then underwent spatial MELODIC ICA (IcaFixProcessingBatch.sh), manual classification of the components as signal or noise, and denoising.
The MSMAll pipeline was then run to re-align the participant’s cortical surface to the standard sur-face template using areal features from the cortical folding map, myelin map, rsfMRI networks, and rsfMRI-based retinotopy. Lastly, the data underwent a dedrift and resample step (DeDriftAndResam-plePipelineBatch.sh). The 210P and 210V cortical parcellations (Glasser et al. 2016) were viewed on the participant’s native brain surface with Connectome Workbench to identify estimates of areas 6v and 44 in T12.
1.4. Neural signal processing
Neural signals were recorded from the microelectrode arrays using the Neuroplex-E system (Blackrock Microsystems) and transmitted via a cable attached to a percutaneous connector. Signals were analog filtered (4th order Butterworth with corners at 0.3 Hz to 7.5 kHz), digitized at 30 kHz (250 nV resolution), and fed to custom software written in Simulink (Mathworks) for digital filtering and feature extraction. Digital filtering began with a highpass filter (300 Hz cutoff) that was applied non-causally to each electrode, using a 4 ms delay, in order to improve spike detection [2]. Linear regression referencing (LRR) was then applied to further reduce reduce ambient noise artifacts [3].
After filtering, binned threshold crossing counts (20 ms bins) were computed by counting the number of times the filtered voltage time series crossed an amplitude threshold set at -4.5 times the standard deviation of the voltage signal. Electrode-specific thresholds and LRR filter coefficients were set using data recorded from an initial “diagnostic” block at the beginning of each session (see section 1.7 for more details). Binned spike band power (20 ms bins) was computed by taking the sum of squared voltages observed during each time bin. Threshold crossing rates and spike band power are commonly used measurements of local spiking activity that have been shown to be comparable to sorted single unit activity in terms of decoding performance and neural population structure [4, 5, 6]. For decoding, threshold crossing counts and spike band power from the 128 electrodes in area 6v were concatenated to yield a 256 × 1 feature vector per time step. For neural tuning analyses (e.g. Figure 1), only threshold-crossing counts were used.
1.5. Data collection rig
Digital signal processing and feature extraction was performed on a dedicated computer using Simulink Real-Time. Extracted features were then sent to a separate computer running Ubuntu for neural decoding and recording. Decoding and recording software was written in Python using TensorFlow 2 and Redis. The Ubuntu computer also ran the experimental task software that displayed cues to T12 on a computer monitor. The task software was implemented using MATLAB and the Psychophysics Toolbox [7]). Finally, a third computer running Windows was used to interface with the Neuroplex-E system and control the starting and stopping of experimental tasks.
1.6. Overview of data collection sessions
Neural data were recorded in 2-4 hour “sessions” on scheduled days, which typically occurred 2 times per week. During the sessions, T12 sat in either a wheelchair or power lift chair in an upright position, with a pillow placed to support her head and neck, and her hands resting on her lap. A computer monitor placed in front of T12 indicated which sentence to speak (or which movement to make) and when. Data were collected in a series of 5-10 minute “blocks” consisting of an uninterrupted series of trials. In between these blocks, T12 was encouraged to rest as needed. Table 2 below lists all 26 data collection sessions reported in this work.
1.7. Instructed delay tasks
All tasks employed an instructed delay paradigm, with each trial consisting of an instructed delay phase followed by a go phase. For sentence speaking blocks, during the delay period the text of the sentence was displayed on the screen above a red square, providing T12 time to read it and prepare to speak. After the delay period, the red square cue then turned green, and the sentence remained on the screen while T12 attempted to speak it (either aloud or by silently mouthing it, depending on the session). When T12 finished speaking the sentence, she pushed a button held in her lap, which triggered the system to move to the next sentence trial.
For the single phoneme task, each phoneme was cued during the delay period with both text and an audio sample of that phoneme being spoken. Vowels were spoken in isolation and cued with the text of a word containing that vowel, with the vowel capitalized (e.g., strUt for 2). Consonants were all paired with the vowel “ah” following the consonant (denoted “AA” in ARPAbet notation and “A” in IPA). Consonants were paired with a text cue evoking the sound of that consonant (e.g. “kah” or “wah”). For the single word task, where T12 spoke individual words from the 50-word Moses et al. word set [8], each word was cued with text only. For the orofacial movement sweep task, each movement was cued with a text description of that movement (e.g., “Tongue Up”).
We ran a single “diagnostic” block at the beginning of each speech decoding session. Data from the diagnostic block was used to set electrode thresholds and linear regression reference (LRR) coefficients, and was also used to examine the rate of change of neural tuning across days. In this block, T12 spoke individual words from a diagnostic set of 7 words designed to span the space of articulation (with 8 repetitions per word). Words were cued with text only. The word set consisted of the following words: ‘bah’, ‘choice’, ‘day’, ‘kite’, ‘though’, ‘veto’, ‘were’.
Delay period durations were pseudorandomly drawn from an exponential distribution with a task-specific mean (see Table 3); values that fell outside of a specified task-specific range were re-drawn. Go period durations were set fixed to a task-specific value for non-sentences tasks (for the sentence production task, T12 advanced the trial by pressing a button). Finally, in the orofacial movement sweep task, the go period was followed by a short “return” phase where T12 relaxed back to a neutral posture before starting the next trial. In all other tasks, the next trial started immediately after the go period ended.
1.8. Voiced vs. silent speaking behavior
For most speaking sessions, T12 was instructed to attempt to produce voiced speech in a “typical” manner (i.e., by trying to move all of her articulators and modulate her larynx to pass sound as one would to attempt to speak normally). The acoustic output was largely unintelligible. During the course of the study, T12 reported that due to her reduced breath control abilities, attempting to produce voiced speech was fatiguing. We experimented with different speaking behavior paradigms and found that “mouthing” or silent speaking yielded similar decoding performance to voiced speech while being less fatiguing for T12. For silent speech sessions, we instructed T12 to pretend that she was mouthing the sentence to someone across the room. During this silent speaking behavior, T12 produced no audible sound, but visibly moved her lips, tongue and jaw. Sessions 14, 24, 25 and 26 were all performed with this silent speaking behavior.
1.9. Decoder evaluation sessions
Real-time speech decoding was evaluated in sessions 18, 19, 21, 22 and 23 for voiced speaking behavior and sessions 24, 25 and 26 for silent speaking behavior. These were the evaluation sessions reported in Fig. 2 and were conducted with the final version of the real-time decoder and parameters. Previous real-time decoding sessions were pilot sessions used to explore different online approaches and parameters. Each evaluation session began with a “diagnostic” block as described previously (section 1.7). This block was used to calculate the threshold values and filters for online LRR that would be used for the rest of the session. Then, we collected “open-loop” blocks of sentences (∼6 blocks with 40 sentences per block) during which no decoder was active. We trained the decoder using these blocks of data (combined with data from all past sessions) to obtain a “stage 1” RNN decoder. Next, we collected additional training blocks with real-time feedback, where the decoder output from the stage 1 RNN was displayed in real-time as T12 attempted to speak each sentence. Upon completion of each sentence, T12 would push a button to indicate she was finished and the decoded text was read aloud using Google Cloud’s Text-To-Speech functionality. After the response was voiced by the computer, T12 pushed the button again to continue to the next sentence. Stage 1 real-time decoding was done for 3-6 blocks, with 40 sentences per block. Finally, the RNN was retrained a second-time using all open-loop and stage 1 data (combined with data from all past sessions) to yield a “stage 2” RNN. The stage 2 RNN was then evaluated on the 50 sentences from Moses et al 2021 [8] as well as 80 randomly-selected sentences from the Switchboard corpus, for which the final error rate and word per minute values were reported.
1.10. Sentence selection
For the first four initial training data collection sessions, sentences were randomly drawn from the Open-WebText2 corpus [9]. For all subsequent sessions, training sentences were drawn from the Switchboard corpus of telephone conversations between speakers of American English [10]. Sentences were selected by first generating lists of potential sentence segments by automatically splitting the transcription using provided punctuation marks. Sentence segments were then filtered to include only those that expressed a complete meaning, and superfluous starter words (e.g. “and”) were deleted. We also did not include sentences with confusing or distracting meaning, such as violent or offensive topics. Finally, we upsam-pled sentences with rare phonemes to ensure there was sufficient training data for the RNN to learn these rare phonemes. This resulted in a diverse sample of sentences from spoken conversational contexts.
For each evaluation day, after the final “stage 2” RNN was trained, three evaluation blocks were run. These included one block of the 50 sentences used for evaluation in Moses et al 2021 [8] (these sentences were the same every session), and two blocks of 40 sentences each from Switchboard, selected in the same manner as the training data. The Switchboard sentences were different for each evaluation session. The RNN decoder was never evaluated on a sentence that it had been trained on, and every sentence was unique (except for the “direct comparison” blocks that always used the same 50 sentences from Moses et al 2021). When we retrained the decoder each day before performance evaluation, we retrained it using all previously collected data (from all prior days) except for these direct comparison blocks, in order to prevent the RNN from overfitting to these repeated sentences.
2. Neural representation of orofacial movements and speech in orofacial cortex
2.1. Tuning heat maps
To generate the neural tuning heat maps shown in Figure 1, we first started with binned threshold crossing spike counts using a -4.5 RMS threshold (20 ms bins). To account for drifts in mean firing rates across the session, the binned threshold crossing rates were mean-subtracted within each block (i.e., for each electrode, its mean firing rate within each block was subtracted from each time step’s binned spike count).
Next, for each trial and electrode, threshold crossing counts were averaged in an 800 ms window (200 to 1000 ms after the go cue). Significance of tuning was then assessed via 1-way ANOVAs applied per electrode, where each ANOVA group corresponded to a different movement condition and each observation was a scalar average firing rate for a single trial. The movement conditions that were used to assess tuning to each movement type are shown in SFig 2. P-values from each ANOVA were used to define tuning significance (p<1e-5).
The fraction of variance accounted for by movement tuning on a single electrode was defined as:
SSTOT is the total sum of squared average firing rates over all trials. For computing SSTOT, squaring was performed after the grand mean across all trials was subtracted from each trial first, so that the overall mean firing rate did not contribute to the variance.
SSERR is the sum of squared prediction errors across all trials. Prediction error was assessed with a cross-validated (5-fold) model which predicts the firing rate of each trial based only on the mean of the condition it belongs to. Condition-specific means were estimated on the training set by taking the sample means across training trials, and then applied to the held-out test set.
If there are large differences in mean firing rate between movement conditions (i.e., strong movement tuning), then SSERR will be small relative to SSTOT. Cross-validation prevents overestimation of tuning due to spurious differences in mean firing rate between conditions that are not stable across folds.
2.2. Neural population tuning bar plots
The amount of neural population tuning for each movement condition in SFig 2 was computed using a cross-validated estimate of neural distance, following the methods as described in [11]. Specifically, we estimated the Euclidean distance between the average firing rate vector for a given movement condition and a baseline “do nothing” condition where T12 was instructed to remain at rest and make no movement. Threshold crossing firing rates were computed over a 800 ms time window (between 200 and 1000 ms after the go cue) using a -4.5 × RMS threshold. Distances were normalized by dividing by the square root of the number of electrodes per array (64), since neural population distances should grow, on average, with the square root of the number of channels. 95% confidence intervals were computed using the jackknife method [11].
2.3. Naive Bayes classification
Offline classification results (reported in SFig 3 and in the main text as classification accuracies) were generated using a cross-validated (leave-one-out) Gaussian naive Bayes classifier, following the methods described in [11], using threshold crossing rates computed in a window from 0 to 1000 ms after the go cue (−4.5 × RMS thresholds were used). 95% confidence intervals for classification accuracies were computed with bootstrap resampling (10,000 resamples). We chose to use a Gaussian naive Bayes classifier because it is a simple method that performed well enough to demonstrate the existence of strong neural tuning - it is likely that more advanced methods could improve classification accuracy further.
2.4. Preserved articulatory representation of phonemes
2.4.1. Electromagnetic articulography (EMA) representations
Electromagentic articulography (EMA) and corresponding audio data was taken from the publicly available dataset “USC-Timit” [12], which contains phoneme labels (beginning and end of each phoneme in each sentence). In USC-Timit, EMA data was collected with 6 markers (lower lip, upper lip, jaw, tongue tip, tongue blade, and tongue dorsum). We used only the X and Y positions of each (sagittal plane position), yielding a 12-dimensional signal.
To compute an EMA representation of each phoneme, we averaged over all EMA marker position data recorded at time points that were labeled as belonging to that phoneme, yielding a single average articulator position vector for each phoneme. Finally, a binary voicing indicator variable was added to the EMA representation of each phoneme. This variable was set to 1 for voiced phonemes (e.g., ‘z’) and 0 for unvoiced phonemes (e.g., ‘s’). Since EMA does not measure voicing, this is a simple way to include some voicing information that would otherwise be omitted from the EMA representation.
EMA data from USC-Timit subject ‘M1’ was used for all Figure 3 analyses.
2.4.2. Saliency vectors
Saliency vectors, which quantify the neural vectors which maximally excite each of the decoder’s phoneme outputs, were generated by computing RNN logit gradients with respect to the input features. We used an RNN trained on all voiced speech days and then used the first 30 trials from each day’s test set to compute the gradients.
For each day’s data, we run the RNN over each sample for five time steps (first initializing the hidden state to zeros) - this allows the network some time to integrate information about the specific feature vector. This yields a phoneme probability output for each time step
For each time step t, we then calculate the Jacobian matrix J, which contains entries corresponding to first-order partial derivatives of each logit output with respect to each channel, i.e.
This gradient records how small changes in each channel’s activity influences the probability of class i. We can calculate the Jacobian for all phonemes and timesteps, resulting in a time x channels x phonemes matrix M where contains the Jacobian at timestep t. We then average across the time dimension to obtain an integrated estimate of how each channel’s activity influences different phoneme class probabilities.
To compute the gradients, we used SmoothGrad [13], a method for denoising saliency maps by computing gradients over an input with multiple noise perturbations, i.e.
The resulting saliency map estimates are then averaged together. We use n = 20 perturbations and noise level = 10 (relative to the overall range of firing rates after capping outliers). This extra step contributed a small but consistent improvement in similarity matrix correlations with the EMA data.
2.4.3. Similarity matrices
Similarity matrices in Figure 3a and 3d were computed using cosine similarity. That is, for each pair (x, y) of RNN saliency or EMA vectors, similarity was defined as
This equation computes the cosine of the angle between x and y. Before computing cosine similarity, the vectors were first centered by subtracting the mean across all consonant (Figure 3a) or all vowel vectors (Figure 3d).
2.4.4. EMA-Neural correlation
Figure 3b and 3e show the correlation between the neural and EMA phoneme representations, which was computed across all phonemes and dimensions after a cross-validated Procrustes alignment of the saliency vectors to the EMA vectors.
First, saliency vectors were concatenated into a 256 × 39 matrix X (256 neural features × 39 phonemes) and EMA vectors were concatenated into a 13 × 39 matrix Y (13 EMA dimensions × 39 phonemes). X and Y were then reduced to 8 dimensions using PCA applied across the rows, yielding an 8 × 39 matrix and an 8 × 39 matrix .
was then aligned to using a cross-validated orthogonal rotation (Procrustes analysis) using leave-one-out cross-validation. Specifically, for each vector xi in , Procrustes analysis was applied to align all other vectors {x1, …xi−1, xi+1, …xn} to the matching vectors {y1, …yi−1, yi+1, …yn}, yielding an orthogonal rotation R. R was then applied to xi to yield . Orthogonality enforces that the rotation be rigid, so that the underlying structure in the data is preserved.
Finally, all vectors were concatenated into an (8×39) × 1 vector and all vectors were concatenated into an (8×39) × 1 vector . The Pearson correlation coefficient was then calculated between and using consonant entries only (Figure 3b) or vowel entries (Figure 3e).
As a control, this same procedure was repeated 10,000 times but with the columns of X shuffled into a random order, which allows estimation of what the correlation could be expected to be ‘by chance’ if each phoneme’s vector was random but drawn from the same distribution. Note that the cross-validation procedure causes the chance distribution to be centered at 0 (otherwise it would be biased upwards as Procrustes would overfit and align noise). The true correlation is far greater than any of the 10,000 shuffle results, indicating statistical significance (p<1e-4).
2.4.5. Low-dimensional visualization of phoneme geometry
To make the plots in Figure 3c and 3f, the neural saliency vectors were first aligned to the EMA vectors using cross-validated Procustes analysis, as described in the above section. After alignment, the top two dimensions were plotted; for vowels, these two dimensions were rotated and flipped within the plane in order to highlight the classic (front vs. back) and (high vs. low) structure.
2.5. Neural correlation across days
To compute how the neural representation of speech was correlated between pairs of days (Fig. 4d), we used data from a “diagnostic block” collected at the beginning of each day. During this block, T12 completed an instructed delay task where she attempted to speak individual words from a set of 7 words designed to span the space of articulation (8 repetitions per word). The word set consisted of the following words: ‘bah’, ‘choice’,’day’,’kite’,’though’,’veto’, and ‘were’. We also included a condition where T12 was instructed to rest silently (‘do nothing’).
First, threshold crossing rates for each trial were averaged between a 100 to 600 ms window after the go cue to yield a single firing rate vector for each trial (of length 128). Then, “pseudo-trial” vectors were created by concatenating together a single firing rate vector from each condition, resulting in pseudotrial vectors of length 128*8=1024. The result of this step is a set of eight vectors {v1, v2, …, v8}, one vector for each of the eight repetitions of all conditions. When assessing the similarity between any two days, we then have two sets of vectors to consider: {v1, v2, …, v8} and {u1, u2, …, u8}. Consider each of these vectors as a random draw from a day-specific distribution (let us denote the two distributions as V and U). To quantify similarity, we estimated the correlation between the means of V and U (note that the means themselves are also vectors). The quantity of interest here is the mean because this represents the average firing rates observed for each condition (i.e., the neural representation of each word). To estimate the correlation between the means of V and U, we used a cross-validated measure of correlation that reduces the impact of noise. See our prior work [11] and accompanying code repository https://github.com/fwillett/cvVectorStats for more details about this method. Importantly, this cross-validated method is different from simply correlating and , which would underestimate the true correlation due to noise that causes the estimated means to appear more dissimilar than they really are. For example, even if V and U have identical means, noise in vi and ui would always cause the estimated correlation to be less than 1 when correlating and together.
To make the plot in Fig. 4d, we included all pairings of the following 12 days on which a diagnostic block of attempted vocal speaking was collected: 2022.06.16, 2022.06.21, 2022.06.28, 2022.07.05, 2022.07.07, 2022.07.14, 2022.07.21, 2022.07.27, 2022.07.29, 2022.08.02, 2022.08.11, 2022.08.13.
3. Decoder performance metrics
3.1. Phoneme transcription and labelling
Each sentence prompt was transcribed into a sequence of phonemes using the CMU Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict) and the g2p software package [14]. The CMU Dictionary uses 39 phonemes, each denoted using the ARPAbet symbol set developed for speech recognition (see Table 4 for the correspondence between IPA notation and ARPAbet notation, and https://en.wikipedia.org/wiki/ARPABET). We added a “silent” phoneme at the end of each word in order to denote the separation between words. Note that we did not incorporate the stress labeling given by the CMU dictionary for vowels (i.e., we labeled each vowel in the same way regardless of how it is stressed in the word).
3.2. Error rates and words per minute
We evaluate both phoneme error rate and word error rate. Phoneme error rate was defined as the edit distance between the decoded sequence of phonemes and the prompt sentence phoneme transcription (i.e., the number of insertions, deletions or substitutions required to make the sequence of phonemes match exactly). Similarly, word error rate was the edit distance defined over sequences of words.
Note that the reported error rates are the combined result of many independent sentences. To combine data across multiple sentences, we summed the number of errors across all sentences and divided this by the total number of phonemes/words across all sentences (as opposed to computing error rate percentages first for each sentences and then averaging the percentages). The helps prevent very short sentences from overly influencing the result.
Words per minute was defined as the number of words spoken over the total amount of time. The time was based on the summations for each trial from which the cue turned green to when the participant pushed the button to signal she had completed saying the prompt sentence.
Confidence intervals for word error rates and words per minute were computed via bootstrap resampling over individual trials and then re-computing error rates and speeds over the resampled distribution (10,000 resamples).
4. RNN architecture
We used a 5 layer, stacked gated recurrent unit RNN [15] to convert T12’s neural activity into a time series of phoneme probabilities. The RNN ran at a 4-bin frequency (20 ms bins), outputting a phoneme probability vector every 80 ms. A 14-bin window of neural activity was stacked together and fed as input to the RNN at each 80 ms cycle (in other words: kernel size = 14, stride = 4). See SFig4 for parameter sweeps that justify these and other architecture choices.
4.1. Feature pre-processing and day-specific input layers
Threshold crossing rates and spike band power features were pre-processed by binning into 20 ms time steps, “z-scoring” (mean-subtracted and divided by the standard deviation), causally smoothed by convolving with a Gaussian kernel (sd = 40ms) that was delayed by 160ms, concatenated into 256 × 1 vector, and then transformed using a day-specific input layer. Z-scoring was performed using block-specific means and standard deviations (to account for non-stationarities in the features that accrue over time across blocks). Using day-specific input layers outperformed the alternative of a shared input layer across all days (SFig 4B).
The day-specific input layers consisted of an affine transformation applied to the feature vector followed by a softsign activation function:
Here, is the day-transformed input vector at time step t, Wi is a 256 × 256 matrix and bi is a 256 × 1 bias vector for day i, and the softsign function is applied element-wise to the resultant vector (where softsign . Wi and bi were optimized simultaneously along with all other RNN parameters. During training, dropout was applied both prior to and after the softsign.
4.2. Rolling z-scoring
During online evaluation, we used a rolling estimate of the mean and standard deviation of each feature to perform z-scoring. This helps account for neural non-stationarities that accrue across time, and substantially outperforms the alternative of using the prior block’s means and standard deviations (SFig 4A).
For the first ten sentences of a new block, we used a weighted average of the prior block’s mean estimate and the mean of whatever sentences were collected so far in the current block:
Here, ui is the mean used to z-score sentence i, u prev is the prior block’s mean estimate, and ucurr is the mean across all sentences collected so far in the current block. After ten sentences had been collected, we stopped incorporating the prior block’s mean and simply took the mean across the most recent min 20, N sentences, where N is the number of sentences collected so far in the current block. The standard deviation was updated in the same way as the mean.
5. RNN training overview
5.1. Connectionist temporal classification (CTC) loss
Due to T12’s inability to produce intelligible speech, we had no ground truth labels of what phonemes were being spoken at each time step. The lack of ground truth labels makes it difficult to apply simple supervised training techniques to train the RNN. To get around this problem, we used the Connectionist Temporal Classification (CTC) loss function, which can train neural networks to output a sequence of symbols (in this case, phonemes) given unlabeled time series input [16]. Using the CTC loss function results in an RNN that is trained to output a time series of phoneme probabilities (with an extra “blank” token probability). A language model can then be used to infer a sequence of underlying words from these probabilities, or phonemes can be decoded from these probabilities simply by emitting the phoneme of maximum probability at each time step (while taking care to omit repeats and time steps where “blank” is the maximum probability).
5.2. Artificial noise
We added two types of artificial noise to the neural features to regularize the RNN. First, we added white noise directly to the input feature vectors at each time step. Adding white noise to the inputs asks the RNN to map clouds of similar inputs to the same output, improving generalization. We also added artificial constant offsets to the means of the neural features, to make the RNN more robust to non-stationarities in the neural data. Drifts in the baseline firing rates that accrue over time has been an important problem for intracortical BCIs [17, 18, 19]. The constant offset values were randomly chosen on each minibatch and were constant across all time steps in the minibatch, but unique to each feature.
The two above-mentioned types of noise (white noise and constant offset noise) were combined together to transform the input vector in the following way:
Here, are the neural features with noise added, xt are the original neural features, ∈t is a white noise vector unique to each time step, and ϕ is a constant offset vector.
5.3. Supervised training
The RNN was implemented with TensorFlow 2 and trained using stochastic gradient descent (ADAM; β1 = 0.9, β2 = 0.999, ∈ = 0.1) for 10,000 minibatches (batch size = 64). The learning rate was decayed linearly from 0.02 to 0.0 across the 10,000 minibatches. We applied dropout and L2 weight regularization during training to improve generalization. See Table 5 for a list of RNN hyperparameters.
6. Offline performance sweeps
6.1. Overview
To determine the effect of different design choices made for the RNN architecture, and to understand the impact of data quantity and channel count, we performed several performance sweeps offline (results from this are shown in SFig 4 and Fig 4). Unless otherwise specified, we trained 10 seeds of an RNN model for each variation of parameters and performed inference on a standardized set of held-out test data. To define the training/held-out set, we took 40 sentences at random from each day as the held-out set and used the remaining sentences as offline training data, drawing from all open-loop/stage 1 sentences on each day (but excluding stage 2 evaluation data). We used the original language model that was run online (as opposed to the improved version reported in Table 1).
We ran parameter sweeps for the GRU-RNN architecture choices including number of units, number of layers, kernel size and stride (SFig 4e-h). A comparison between a shared input network and unique input network per session was also run (SFig 4b). Furthermore, performance when using different kinds of features was also compared in this manner (SFig 4c-d), including four different threshold crossing thresholds (−3.5,-4.5,-5.5,-6.5), spike band power, area 44 vs. area 6v features, and the mel-frequency cepstral coefficients (MFCCs) of the participants’ recorded audio during attempted speaking sessions. MFCCs were computed using 40 ms window (using MATLAB 2020a’s “mfcc” function).
6.2. Effect of channel count on performance
To determine the effect of channel count on decoding performance (Fig 4b), 100 seeds of each multiple of 10 number of channels up to the full 128 channels was run. For each of the 100 seeds for each channel count, channels were randomly selected without replacement. To predict performance for higher number of channels past 128, a least squares linear regression was fit to the log-log relationship of the number of channels vs. error rate.
6.3. Amount of training data
To plot the number of days of training data versus performance (SFig 4i), RNNs were trained for each of the 5 vocal speaking evaluation days separately, and for each number of training data days going consecutively in reverse until all previous days were used in training. Performance was assessed only on the given evaluation day. Word error rates were then averaged over all evaluation days to produce a single (# of days) vs. (word error rate) curve.
We also tested whether or not it was necessary to retrain the RNN decoder on each new performance evaluation day using hundreds of new sentences collected on that day, or whether fewer (or no) new sentences might have also yielded good performance, which would be a more realistic use case (Fig 4c). For this analysis, models were trained on the five attempted speech evaluation sessions (sessions 18,19,21,22,23) using reduced subsets of sentences from the given evaluation day (while still using all historical data). The input layer for each given evaluation day was also tied to be the same as the most recent historical day, in order to prevent overfitting when using a small number of training sentences. Once trained, RNNs were evaluated on the same set of “stage 2” online evaluation sentences used to report performance in Figure 2.
6.4. Language model vocabulary size sweep
To test how the number of words in language model (LM) affects the decoding accuracy, we built different 3-gram LMs with various vocabulary sizes. These LMs were built following the same procedure as in Section 7, but with vocabulary sizes varying from 50 to 140,000. We started with the 50 words from [8], and gradually added words until the vocabulary size reached 140,000. The added words were chosen from the LM training corpus. Words were added in the order of their frequencies in the training corpus. When the vocabulary size became greater then 4500, we pruned the LM with threshold 1e−9. To measure the WER, we ran the LM decoders on the CTC probabilities output by RNN from the 8 real-time speech decoding sessions (18, 19, 21, 22, 23, 24, 25, and 26).
7. Language model
7.1. Overview
We used a n-gram language model (LM) to decode word sequences from RNN outputs for real-time decoding and offline analyses. Here, we give an overview of the major steps involved. The n-gram LM was created with Kaldi [20] using OpenWebText2 corpus [9]. We first preprocessed the text corpus to only include English letters and limited punctuation marks. Then we used Kaldi to construct a n-gram LM, using either the CMU Pronunciation Dictionary 1 (125k words) or the 50 words from [8]. The LM was represented in the form of a weighted finite-state transducer [21] which can be used to translate the a sequence of CTC labels into candidate sentences.
7.2. OpenWebText2 preprocessing
Our n-gram LM was created using samples from OpenWebText2 [9]. OpenWebText2 is a text corpus covering all Reddit submissions from 2005 up until April 2020. We downloaded the entire corpus and randomly sampled 95% as a training corpus. We preprocessed the training corpus to include only English letters and 4 punctuation marks (period, comma, apostrophe, and question mark). The preprocessed corpus was then split into sentences and converted to upper case (yielding a total of 634M sentences with 99B words).
7.3. Constructing the n-gram language model
We used publicly available scripts2 as a starting point for constructing our n-gram LM. The script first uses SRILM [22] to count the frequencies of n-grams (unigram, bi-gram, and 3-gram, etc.) in the training corpus. We used the Good-Turing discounting method [23] to improve probability estimation of unseen or rare word combinations. For words that are not in the pronunciation dictionary, they are mapped to a special token <UNK>. When using the CMU Pronunciation Dictionary, the resutling LM is too large to fit into the main memory of the Ubuntu computer used for real-time inference. We pruned the resulting n-gram LM using SRILM, which removes n-grams that causes the perplexity of the LM to increase by less than a threshold. The LM built with the 50 words from [8] is not pruned. For online real-time decoding, we used a 3-gram LM pruned with threshold 1e − 9. For offline analyses, we used a 5-gram LM pruned with threshold 4e − 11.
The n-gram LM was then converted to a weighted finite-state transducer (WFST) [21]. A WFST is a finite-state acceptor in which each transition has an input symbol, an output symbol and a weight. A path through the WFST takes a sequence of input symbols and emits a sequence of output symbols. We followed the recipe in [24] to construct our WFST search graph:
Here, denotes composition. G is the grammar WFST that encodes legal sequences of words and their probabilities based on the n-gram LM. L is the lexicon WFST that encodes what phonemes are contained in each legal word. A silence state is added to the beginning of the sentence and the end of each word to model the non-speaking state. We did an offline sweep of the silence state probability and found 0.9 to be optimal. Finally, T is the token WFST that maps a sequence of RNN output labels to a single phoneme. In our case, T contains all the individual phonemes plus the CTC blank symbol. For more details about how the three WFSTs were composed, refer to [24].
7.4. Inference with the n-gram language model
We used the LM decoder implementation in WeNet [25] for efficient real-time inference. WeNet is a wrapper around Kaldi to simplify the implementation of a real-time LM decoder. The LM decoder runs an approximate Viterbi search (beam search) algorithm on the WFST search graph to find the most likely sequences of words. The WFST search graph encodes the mapping from a sequence of CTC labels emitted by RNN to a sequence of words. During inference, the beam search combines information from the WFST (state transition probabilities) and information from the RNN decoder about which CTC labels are likely occurring at each moment in time. We do not normalize the CTC label probabilities as in [24].
The decoding parameters for beam search in defined in Table 6. The beam search runs every 80ms, after the RNN emits CTC label probabilities. On average, each beam search step took less than 1ms to complete.
7.5. Offline language model optimization
After data collection was completed, we further optimized the LM and found that online decoding WER could have improved by 6.4% with an improved LM architecture (Table 1 in main text). To improve the LM, we used a 5-gram LM instead of 3-gram LM and employed a 2-pass decoding strategy. The first pass of the 2-pass decoder is the same as the 1-pass decoder described above. But instead of outputting a decoded sentence, it outputs a word lattice [26, 27]. A word lattice is a directed graph where each node is a word and the an edge between nodes encodes the transition probability between words. It is a efficient representation to encode possible word sequences. The second pass of the 2-pass decoder uses a unpruned n-gram LM to rescore the word lattice. Rescoring replaces the original LM score with a more accurate score from the unpruned LM. After rescoring, we pick the best path through the word lattice as decoding output.
Finally, we found that using a transformer LM [28] to rescore the candidate sentences in an third pass could further improve decoding accuracy. Transformer LMs have been the state of the art in many natural language tasks in recent years [29, 30]. Compared to an n-gram LM which models a limited context (e.g., 3 words for a trigram model), a transformer LM can model much longer contexts (e.g., 1024 words). Training a transformer LM requires a significant amount of computation resources. We used the publicly available pre-trained OPT LM [31]. We used the largest OPT LM (6.7B parameters) that can fit into one NVIDIA A100 40GB GPU. The OPT LM was used to rescore the n-best outputs from a 2-pass decoder.
The 2-pass decoder first outputs at most n sentences with the highest decoding scores. The decoding score of a sentence was defined as follows:
Here PRN N s is the sentence s’s corresponding CTC label sequence probability output by the RNN. Pngram is the sentence s’s probability estimated by the n-gram LM. α is the acoustic scale defined in Table 6.
We then used OPT to evaluate the probability of each sentence in the n-best list and linearly interpolate with the n-gram LM’s probability. The new score function was defined as follows:
Here Popt s is sentence s’s probability estimation from OPT LM. β is the lm weight defined in Table6. The top scored sentence is the final decoding output.
Finally, we found that decoding accuracy was improved by dividing the CTC blank label probability by a constant value [32], which adds a cost for not outputting any labels.
All LM decoding parameters are optimized via grid search on a validation data set (session 7-16).
Acknowledgments
We thank participant T12 and her caregivers for their generously volunteered time and effort as part of the BrainGate2 pilot clinical trial, Beverly Davis, Kathy Tsou, and Sandrin Kosasih for administrative support, and Yaguang Hu for providing suggestions about the language model. Support provided by the Office of Research and Development, Rehabilitation R&D Service, Department of Veterans Affairs (N2864C, A2295R); Wu Tsai Neurosciences Institute; Howard Hughes Medical Institute; Larry and Pamela Garlick; Simons Foundation Collaboration on the Global Brain 543045; NIDCD R01-DC014034; NIDCD U01-DC017844; NINDS U01-NS123101.