Artificial Neural Networks Accurately Predict Language Processing in the Brain

The ability to share ideas through language is our species’ signature cognitive skill, but how this feat is achieved by the b rain 2 remains unknown. Inspired by the success of artificial neural networks (ANNs) in explaining neural responses in perceptual tasks 3 (Kell et al., 2018; Khaligh-Razavi & Kriegeskorte, 2014; Schrimpf et al., 2018; Yamins et al., 2014; Zhuang et al., 2017), we here


1
The ability to share ideas through language is our species' signature cognitive skill, but how this feat is achieved by the brain 2014) only poorly predict neural responses (<10% predictivity). Models' predictivities are consistent across neural datasets, and 10 also correlate with their success on a next-word-prediction task (but not other language tasks) and ability to explain human 11 comprehension difficulty in an independent behavioral dataset. Intriguingly, model architecture alone drives a large portion of As you read these words, your brain is extracting-from simple 22 marks on the page-a complex and structured representation 23 of the meaning of this sentence. This comprehension process 24 causally depends on a left-lateralized fronto-temporal brain 25 network ( comprehensively comparing the models' match to human data provides a way to begin testing those hypotheses. 68 69 Here, we tested the relationship between 43 diverse state-of-the-art ANN language models (henceforth 'models') and 70 three neural language comprehension datasets (two fMRI, one ECoG) to address four fundamental questions. First, we 71 asked which models best capture human neural responses. Second, we tested whether the models that are most brain-like 72 Figure 1: Comparing Artificial Neural Network models of language processing to activity in the brain's language network. We tested how well different models predict human measurements of neural activity during language comprehension. The candidate models ranged from simple embedding models to more complex recurrent and transformer networks. Stimuli ranged from sentences to passages to stories and were 1) fed into the models, and 2) presented to human participants (visually or auditorily). The resulting internal representations were then 1) captured from the models, and 2) recorded from humans with fMRI or ECoG. To compare model and human representations, we regressed model representations on the corresponding human measurements in response to 80% of the stimuli, and then compared model predictions on the held-out 20% of stimuli against the held-out human data with Pearson correlation (cross-validated 5-fold), yielding a similarity score for each model-data pair.
in their representations perform best on a next-word prediction task, as might be expected given the critical role of 73 predictive processing in human perception and cognition (  Third, we asked whether the models with the highest neural predictivity  76 also best capture behavioral signatures of human language processing in a reading task. Finally, we explored the relative 77 contributions to brain predictivity of two different aspects of model design: network architecture and training experience, 78 to begin to test different hypotheses about the computations performed by the brain's language network and about how it 79 might have arisen through some combination of evolutionary and learning-based optimization. 80 81

82
We evaluated a broad range of state-of-the-art ANN models on the match of their internal representations to three human 83 neural datasets. The models spanned all major classes of existing language models and included embedding models (e.g., 84 GloVe  quantitatively exhibits the highest internal reliability (Fig. S1), and then generalized the findings to the other datasets. 95 Because our research questions concern language processing, we extracted neural responses from language-selective 96 voxels or electrodes that were functionally identified by an extensively validated independent "localizer" task that 97 contrasts reading sentences versus nonword sequences (Fedorenko et al., 2010). This localizer robustly identifies the 98 fronto-temporal language-selective network (Methods_1-3, Fig. 2b, S3). 99 To compare a given model to a given dataset, we presented the same stimuli to the model that were presented to humans 100 in neural recording experiments and 'recorded' the model's internal activations (Methods-5-6, Fig. 1 measurements by computing Pearson's correlation coefficient. We further normalized these correlations by the 108 extrapolated reliability of the particular dataset, which places an upper bound ("ceiling") on the correlation between the 109 neural measurements and any external predictions (Methods_7, Fig. S1). The final measure of a model's performance 110 ('predictivity' or 'score') on a dataset is thus Pearson's correlation between model predictions and neural recordings 111 divided by the estimated ceiling and averaged across voxels/electrodes/regions and participants. We report the score for 112 the best-performing layer of each model (Methods_6, Fig. S10). 113 Indeed, next-word-prediction task performance robustly predicts neural scores ( Fig. 3 Figure 3: Model performance on a next-word prediction task predicts neural scores. Next-word-prediction task performance was evaluated as the surprisal between the predicted and true next word in the WikiText-2 dataset of 720 Wikipedia articles, or perplexity (x-axis, lower is better). Next-word-prediction task scores strongly predict neural scores for Pereira2018 and Blank2014; the correlation for Fedorenko2016 is also positive but not significant.  and suggest that further improving models' neural predictivity will simultaneously improve their behavioral predictivity.

170
An intriguing outlier in this analysis is the skip-thoughts model, which predicts neural activity only moderately, but predicts 171 reading times at ceiling. 172 Next-word-prediction task performance correlates with behavioral predictivity. Next-word-prediction task performance is predictive 173 of reading times ( Fig. 4c; r=.37, p<.05). Note that this relationship, similar to the brain-to-behavior one, is not as strong as 174 the one between next-word-prediction task performance and neural predictivity. This difference could point to additional 175 mechanisms, on top of predictive language processing, that were recruited for the reading task. 176 Model architecture alone yields predictive representations. All models come with intrinsic architectural properties, like size, 177 the presence of recurrence, and the directionality and length of context used to perform the target task (Methods_5, Table  178 S9). These differences strongly affect model performance on normative tasks like next-word prediction after training, and 179 define the representational space that the model can learn (Arora et al., 2018;Fukushima, 1988). To test whether model 180 architecture alone-without training-already yields representational spaces that are similar to those implemented by the 181 language network in the brain, we evaluated models with their initial (random) weights. Strikingly, even with no training, 182 several model architectures reliably predicted brain activity (Fig. 5). For example, on Pereira2018, untrained GPT2-xl 183 Figure 5: Model architecture alone already yields predictive representations and untrained performance predicts trained performance. We evaluate untrained models by keeping weights at their initial random values. The remaining representations are driven by architecture alone and are tested on the three neural datasets (a, Fig. 2) and the behavioral dataset (b, Fig. 4). Across all datasets, architecture alone yields representations that predict human brain activity considerably well. For Pereira2018, training typically improves scores by 85% whereas for Fedorenko2016, Blank2014 and Futrell2018 training does not always change and for some models even decreases the similarity with human measurements. achieves predictivity of ~80%, only ~20% lower than the trained network. (Importantly, a random context-independent 184 embedding with equal dimensionality but no architectural priors predicts only 27% of this dataset (Fig. S7), suggesting that a 185 large feature space alone, without architectural priors, is not sufficient.) A similar trend is observed across models: training 186 generally improves neural predictivity, on average by .3 (85% relative improvement). Across models, the untrained scores 187 are strongly predictive of the trained scores (r=.71, p<<.00001), indicating that models that predict human neural data 188 poorly with random weights also perform poorly after training, but models that already predict neural data well with 189 random weights improve further with training. 190 Although untrained scores are also predictive of the trained scores for Fedorenko2016 and Blank2014, some training 191 objectives seem to make model representations less brain-like. AlBERT models, for instance, drop from an untrained score 192 of ≈ 87% on Fedorenko2016 to a trained score of ≈ 57%. The best model in our analyses, GPT2-xl, also predicts Blank2014 193 better without training: its performance decreases from 64% with random weights to 32% after training. 194 195

196
Our results, summarized in Fig. 6, show that specific ANN language models can predict human neural and behavioral 197 responses to linguistic input with high accuracy: the best models achieve, on some datasets, perfect predictivity relative to 198 the noise ceiling. Neural predictivity correlates across datasets spanning recording modalities (fMRI, ECoG, reading times) 199 and diverse materials presented visually and auditorily, establishing the robustness and generality of these findings. Both 200 neural and behavioral predictivity correlate with model performance on the normative next-word prediction task, but not 201 other language tasks (SI-5). Finally, model architectures alone, with random weights, produce representations that capture 202 neural and behavioral linguistic responses and closely 'track' the representations with learned weights across datasets.

204
Underlying this and related work is the idea that large-scale neural networks can serve as possible mechanistic hypotheses 205 of brain processing. The first goal, we would argue, is to identify models that are highly predictive of brain activity. The most 206 predictive models can then be dissected to uncover the features critical for achieving high brain scores, and can serve as the 207 basis for the next generation of mechanistic models of human language comprehension. We found a number of models 208 with perfect/near-perfect predictivity of neural and behavioral language responses, and took the first steps to interpret 209 these models. First, we found that the models' neural predictivity strongly relates to their performance on a normative 210 'language modeling' task (next-word prediction) but not other tasks (SI-5; also (Gauthier & Levy, 2019)). Language modeling 211 is the task of choice in the natural language processing (NLP) community: it is simple, unsupervised, scalable, and appears to 212 produce the most generally useful, successful language representations. This is likely because language modeling 213 encourages a neural network to build a joint probability model of the linguistic signal, which implicitly requires sensitivity to 214 diverse kinds of regularities in the signal. However, at a deeper conceptual level, predictive processing has advanced to the 215 forefront of theorizing in cognitive science (Clark, 2013; Tenenbaum et al., 2011) and neuroscience (Keller & Mrsic-Flogel, 216 Figure 6: Summary of the key results. Normalized neural and behavioral predictivities are shown in circles. For the neural datasets (averaged and individual, top row), and for the behavioral dataset (bottom right), we report i) the value for the model achieving the highest predictivity (GPT2-xl is the model with the highest average score across the four human datasets), and ii) the correlation between the untrained and trained scores. As discussed in Results, architecture alone yields reasonable scores which are also predictive of model scores after training. The next-word prediction task (bottom left) predicts neural and behavioral scores; and neural scores predict behavioral scores. An intriguing possibility is therefore that both the human language system and the ANN models of language are optimized 225 to predict upcoming words in the service of efficient meaning extraction. 226 227 We also demonstrated that architecture alone, with random weights, can yield representations that match human brain 228 data well. If we construe model training as analogous to learning in human development, then human cortex might already 229 provide a sufficiently rich structure that allows for the rapid acquisition of language ( structural changes are tested and the best-performing ones are incorporated into the next generation of models. 234 Importantly, this process implicitly still optimizes for certain tasks (like language modeling), only on a different timescale. 235 236 These discoveries pave the way for many exciting future directions. The most brain-like language models can now be 237 investigated in richer detail, ideally leading to intuitive theories around their inner workings. Such research is much easier to 238 perform on models than on biological systems since all their structure and weights are easily accessible and manipulable 239 ( based on their performance on the next-word-prediction task vs. other tasks (also, Gauthier & Levy, 2019). Re-training many 244 models on many tasks from scratch might determine which features are most important for brain predictivity, but is 245 currently prohibitively expensive due to the insurmountable space of hyper-parameters. Further, the fact that language 246 modeling is inherently built into the evolution of language models by the NLP community, as noted above, may make it 247 impossible to fully eliminate its influences on the architecture even for models trained from scratch on other tasks. 248 249 How can we develop models that are even more brain-like? Despite impressive performance on the datasets and metrics 250 here, ANN language models are far from human-level performance in the hardest problems of language understanding. An 251 important open direction is to integrate language models like those used here with models and data resources that attempt 252 to capture aspects of meaning important for commonsense world knowledge on Blank2014, which uses story materials and may therefore require long-range contexts. 258 259 One key missing piece in the mechanistic modeling of human language processing is a more detailed mapping from model 260 components onto brain anatomy. In particular, aside from the general targeting of the fronto-temporal language network, it 261 is unclear which parts of a model map onto which components of the brain's language processing mechanisms. In models of The network that supports higher-level linguistic interpretation-which we focus on here-is extensive and plausibly 267 contains meaningful functional dissociations, but how the network is precisely subdivided and what respective roles its 268 different components play remains debated. Uncovering the internal structure of the human language network, for which 269 intracranial recording approaches with high spatial and temporal resolution may prove critical (Mukamel & Fried, 2012;  270 Parvizi & Kastner, 2018), would allow us to guide and constrain models of tissue-mapped mechanistic language processing. 271 More precise brain-to-model mappings would also allow us to test the effects of perturbations on models and compare 272 them against perturbation effects in humans, as assessed with lesion studies or reversible stimulation. More broadly, 273 anatomically and functionally precise models are a required software component of any form of brain-machine-interface. 274 275 Taken together, our findings lay a critical foundation for a promising research program synergizing high-performing 276 mechanistic models of natural language processing with large-scale neural and behavioral measurements of human 277 language processing in a virtuous cycle: testing model ability to predict neural and behavioral brain measurements, 278 dissecting the best-performing models to understand which components are critical for high brain predictivity, developing 279 better models leveraging this knowledge, and collecting new data to challenge and constrain the future generations of 280 neurally plausible models of language processing. 281 282     four sentences each), and stimuli for Experiment 3 consisted of 243 sentences (72 text passages, 3 or 4 sentences each). The 642 two sets of materials were constructed independently, and each spanned a broad range of content areas. Sentences were 643 7-18 words long in Experiment 2, and 5-20 words long in Experiment 3. The sentences were presented on the screen one at 644 a time for 4s (followed by 4s of fixation, with additional 4s of fixation at the end of each passage), and each participant read 645 each sentence three times, across independent scanning sessions (see Pereira et al., 2018 for details of experimental 646 procedure and data acquisition). 647

References
Preprocessing and response estimation: Data preprocessing was carried out with SPM5 (using default parameters, unless 648 specified otherwise) and supporting, custom MATLAB scripts. (Note that SPM was only used for preprocessing and basic 649 modeling-aspects that have not changed much in later versions; for several datasets, we have directly compared the 650 outputs of data preprocessed and modeled in SPM5 vs. SPM12, and the outputs were nearly identical.) Preprocessing 651 included motion correction (realignment to the mean image of the first functional run using 2nd-degree b-spline 652 interpolation), normalization (estimated for the mean image using trilinear interpolation), resampling into 2mm isotropic 653 voxels, smoothing with a 4mm FWHM Gaussian filter and high-pass filtering at 200s. A standard mass univariate analysis 654 was performed in SPM5 whereby a general linear model (GLM) estimated the response to each sentence in each run. These 655 effects were modeled with a boxcar function convolved with the canonical Hemodynamic Response Function (HRF). The 656 model also included first-order temporal derivatives of these effects (which were not used in the analyses), as well as 657 nuisance regressors representing entire experimental runs and offline-estimated motion parameters. 658 responsive voxels (voxels with the highest t-value for the localizer contrast) following the standard approach in prior work. 675 This approach allows to pool data from the same functional regions across participants even when these regions do not 676 align well spatially. Functional localization has been shown to be more sensitive and to have higher functional resolution 677 (Nieto-Castanon & Fedorenko, 2012) than the traditional group-averaging approach (Holmes & Friston, 1998), which 678 assumes voxel-wise correspondence across participants. This is to be expected given the well-established inter-individual 679 differences in the mapping of function to anatomy, especially pronounced in the association cortex (e.g., Frost & Goebel, We constructed a stimulus-response matrix for each of the two experiments by i) averaging the BOLD responses to each 682 sentence in each experiment across the three repetitions, resulting in 1 data point per sentence per language-responsive 683 voxel of each participant, selected as described above (13,553 voxels total across the 10 participants; 1,355 average, ±6 std. dev.), and ii) concatenating all sentences (384 in Experiment 2 and 243 in Experiment 3), yielding a 384x12,195 matrix for