The neural architecture of language: Integrative modeling converges on predictive processing

The neuroscience of perception has recently been revolutionized with an integrative modeling approach in which computation, brain function, and behavior are linked across many datasets and many computational models. By revealing trends across models, this approach yields novel insights into cognitive and neural mechanisms in the target domain. We here present a first systematic study taking this approach to higher-level cognition: human language processing, our species’ signature cognitive skill. We find that the most powerful ‘transformer’ models predict nearly 100% of explainable variance in neural responses to sentences and generalize across different datasets and imaging modalities (fMRI, ECoG). Models’ neural fits (‘brain score’) and fits to behavioral responses are both strongly correlated with model accuracy on the next-word prediction task (but not other language tasks). Model architecture appears to substantially contribute to neural fit. These results provide computationally explicit evidence that predictive processing fundamentally shapes the language comprehension mechanisms in the human brain. Significance Language is a quintessentially human ability. Research has long probed the functional architecture of language processing in the mind and brain using diverse brain imaging, behavioral, and computational modeling approaches. However, adequate neurally mechanistic accounts of how meaning might be extracted from language are sorely lacking. Here, we report an important first step toward addressing this gap by connecting recent artificial neural networks from machine learning to human recordings during language processing. We find that the most powerful models predict neural and behavioral responses across different datasets up to noise levels. Models that perform better at predicting the next word in a sequence also better predict brain measurements – providing computationally explicit evidence that predictive processing fundamentally shapes the language comprehension mechanisms in the human brain.

A core goal of neuroscience is to decipher from patterns of neural activity the algorithms underlying our abilities to 29 perceive, think, and act. Recently, a new "reverse engineering" approach to computational modeling in systems 30 neuroscience has transformed our algorithmic understanding of the primate ventral visual stream ( and holds great promise for other aspects of brain function. This approach has been enabled by a breakthrough in artificial 33 intelligence (AI): the engineering of artificial neural network (ANN) systems that perform core perceptual tasks with 34 unprecedented accuracy, approaching human levels, and that do so using computational machinery that is abstractly similar 35 to biological neurons. In the ventral visual stream, the key AI developments come from deep convolutional neural networks 36 (DCNNs) that perform visual object recognition from natural images ( away aspects of biology, DCNNs provide the basis for a first complete hypothesis of how the brain extracts object percepts 41 from visual input. 42 43 Inspired by this success story, analogous ANN models have now been applied to other domains of perception (Kell et al.,44 2018; Zhuang et al., 2017). Could these models also let us reverse-engineer the brain mechanisms of higher-level human 45 cognition? Here we show for the first time how the modeling approach pioneered in the ventral stream can be applied to a 46 higher-level cognitive domain that plays an essential role in human life: language comprehension, or the extraction of 47 meaning from spoken, written or signed words and sentences. Cognitive scientists have long treated neural network models 48 of language processing with skepticism (Marcus, 2018; Pinker & Prince, 1988) given that these systems lack (and often 49 deliberately attempt to do without) explicit symbolic representation -traditionally seen as a core feature of linguistic 50 meaning. Recent ANN models of language, however, have proven capable of at least approximating some aspects of 51 symbolic computation, and have achieved remarkable success on a wide range of applied natural language processing (NLP) 52 tasks. The results presented here, based on this new generation of ANNs, suggest that a computationally adequate model of 53 language processing in the brain may be closer than previously thought. 54 55 Because we build on the same logic in our analysis of language in the brain, it is helpful to review why the neural network-56 based integrative modeling approach has proven so powerful in the study of object recognition in the ventral stream. 57 Crucially, our ability to robustly link computation, brain function, and behavior is supported not by testing a single model on 58 a single dataset or a single kind of data, but by large-scale integrative benchmarking (Schrimpf et al., 2020) that establishes 59 consistent patterns of performance across many different ANNs applied to multiple neural and behavioral datasets, 60 together with their performance on the proposed core computational function of the brain system under study. Given the 61 complexities of the brain's structure and the functions it performs, any one of these models is surely oversimplified and 62 ultimately wrong -at best, an approximation of some aspects of what the brain does. But some models are less wrong than 63 others, and consistent trends in performance across models can reveal not just which model best fits the brain, but which 64 properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell 65 us. 66 In the ventral stream specifically, our understanding that computations underlying object recognition are analogous to the 67 structure and function of DCNNs is supported by findings that across hundreds of model variants, DCNNs that perform better 68 on object recognition tasks also better capture human recognition behavior and neural responses in IT cortex of both human 69 and non-human primates (Rajalingham et al., 2018;Schrimpf et al., 2018Schrimpf et al., , 2020Yamins et al., 2014). This integrative 70 benchmarking reveals a rich pattern of correlations among three classes of performance measures -(i) neural variance 71 explained, in IT neurophysiology or fMRI responses (brain scores), (ii) accuracy in predicting hits and misses in human object 72 recognition behavior, or human object similarity judgments (behavioral scores), and (iii) accuracy on the core object 73 recognition task (computational task score) -such that for any individual DCNN model we can predict how well it would 74 score on each of these measures from the other measures. This pattern of results was not assembled in a single paper but in 75 multiple papers across several labs and several years. Taken together, they provide strong evidence that the ventral stream 76 supports primate object recognition through something like a deep convolutional feature hierarchy, the exact details of which 77 are being modeled with ever-increasing precision. 78 Here we describe an analogous pattern of results for ANN models of human language, establishing a link between language 79 models, including transformer-based ANN architectures that have revolutionized natural language processing in AI systems 80 over the last three years, and fundamental computations of human language processing as reflected in both neural and 81 behavioral measures. Language processing is known to depend causally on a left-lateralized fronto-temporal brain network 82 ( : Comparing Artificial Neural Network models of language processing to human language processing. We tested how well different models predict measurements of human neural activity (fMRI and ECoG) and behavior (reading times) during language comprehension. The candidate models ranged from simple embedding models to more complex recurrent and transformer networks. Stimuli ranged from sentences to passages to stories and were 1) fed into the models, and 2) presented to human participants (visually or auditorily). Models' internal representations were evaluated on three major dimensions: their ability to predict human neural representations (brain score, extracted from within the fronto-temporal language network (e.g., Fedorenko et al., 2010; the network topography is schematically illustrated in red on the template brain above); their ability to predict human behavior in the form of reading times (behavioral score); and their ability to perform computational tasks such as next-word prediction (computational task score). Consistent relationships between these measures across many different models reveal insights beyond what a single model can tell us. Lakretz  studies have attempted large-scale integrative benchmarking that has proven so valuable in understanding key brain-94 behavior-computation relationships in the ventral stream; instead, they have typically tested one or a small number of models 95 against a single dataset, and the same models have not been evaluated on all three metrics of neural, behavioral, and 96 objective task performance. Previously tested models have also left much of the variance in human neural/behavioral data 97 unexplained. Finally, until the rise of recent ANNs (e.g., transformer architectures), language models did not have sufficient 98 capacity to solve the full linguistic problem that the brain solves -to form a representation of sentence meaning capable of 99 performing a broad range of real-world language tasks on diverse natural linguistic input. We are thus left with a collection 100 of suggestive results but no clear sense of how close ANN models are to fully explaining language processing in the brain, or 101 what model features are key in enabling models to explain neural and behavioral data. 102 Our goal here is to present a first systematic integrative modeling study of language in the brain, at the scale necessary to 103 discover robust relationships between neural and behavioral measurements from humans, and performance of models on 104 language tasks. We seek to determine not just which model fits empirical data best, but what dimensions of variation across 105 models are correlated with fit to human data. This approach has not been applied in the study of language or any other higher 106 cognitive system, and even in perception has not been attempted within a single integrated study. Thus, we view our work 107 more generally as a template for how to apply the integrative benchmarking approach to any perceptual or cognitive system. 108 Specifically, we examined the relationships between 43 diverse state-of-the-art ANN language models (henceforth 'models') 109 across three neural language comprehension datasets (two fMRI, one electrocorticography (ECoG)), as well as behavioral 110 signatures of human language processing in the form of self-paced reading times, and a range of linguistic functions assessed 111 via standard engineering tasks from NLP. The models spanned all major classes of existing ANN language approaches and 112 included simple embedding models (e.g., GloVe (Pennington et al., 2014)), more complex recurrent neural networks (e.g., 113 LM1B (Jozefowicz et al., 2016)), and many variants of transformers or attention-based architectures-including both 114 'unidirectional-attention' models (trained to predict the next word given the previous words; e.g., GPT (Radford et al., 2019)) 115 and 'bidirectional-attention' models (trained to predict a missing word given the surrounding context; e.g., BERT (Devlin et  116 al., 2018)). 117 Our integrative approach yielded four major findings. (1) Models' relative fit to neural data (neural predictivity or "brain 118 score'')-estimated on held-out test data-generalizes across different datasets and imaging modality (fMRI, ECoG), and 119 certain architectural features consistently lead to more brain-like models: transformer-based models perform better than 120 recurrent networks or word-level embedding models, and larger-capacity models perform better than smaller models.
(2) 121 The best models explain nearly 100% of the explainable variance (up to the noise ceiling) in neural responses to sentences. 122 This result stands in stark contrast to earlier generations of models that have typically accounted for at most 30-50% of the 123 predictable neural signal.
(3) Across models, significant correlations hold among all three metrics of model performance: brain 124 scores (fit to fMRI and ECoG data), behavioral scores (fit to reading time), and model accuracy on the next-word prediction 125 task. Importantly, no other linguistic task was predictive of models' fit to neural or behavioral data. These findings provide 126 strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain's 127 language system is optimized for predictive processing in the service of meaning extraction. (4) Intriguingly, the scores of 128 models initialized with random weights (prior to training, but with a trained linear readout) are well above chance and 129 correlate with trained model scores, which suggests that network architecture is an important contributor to a model's brain 130 score. In particular, one architecture introduced just in 2019, the generative pre-trained transformer (GPT-2), consistently 131 outperforms all other models and explains almost all variance in both fMRI and ECoG data from sentence processing tasks. 132 GPT-2 is also arguably the most cognitively plausible of the transformer models (because it uses unidirectional, forward 133 attention), and performs best overall as an AI system when considering both natural language understanding and natural 134 language generation tasks. Thus, even though the goal of contemporary AI is to improve model performance and not 135 necessarily to build models of brain processing, this endeavor appears to be rapidly converging on architectures that might 136 capture key aspects of language processing in the human mind and brain. 137 138

139
We evaluated a broad range of state-of-the-art ANN language models on the match of their internal representations to three 140 human neural datasets. The models spanned all major classes of existing language models (Methods_5, within each participant and quantitatively exhibits the highest internal reliability (Fig. S1). Because our research questions 151 concern language processing, we extracted neural responses from language-selective voxels or electrodes that were 152 functionally identified by an extensively validated independent 'localizer' task that contrasts reading sentences versus 153 nonword sequences (Fedorenko et al., 2010). This localizer robustly identifies the fronto-temporal language-selective 154 network (Methods_1-3). 155 To compare a given model to a given dataset, we presented the same stimuli to the model that were presented to humans 156 in neural recording experiments and 'recorded' the model's internal activations (Methods_5-6, Fig. 1). We then tested how 157 well the model recordings could predict the neural recordings for the same stimuli, using a method originally developed for 158 studying visual object recognition ( dataset is thus Pearson's correlation between model predictions and neural recordings divided by the estimated ceiling and 167 averaged across voxels/electrodes/regions and participants. We report the score for the best-performing layer of each model 168 (Methods_6, Fig. S12) but controlled for the generality of the layer choice in a train/test split (Fig. S2b, c). 169 Specific models accurately predict human brain activity. We found ( Fig. 2a-b) that specific models predict Pereira2018 and 170 Fedorenko2016 datasets with up to 100% predictivity relative to the noise ceiling (Methods_7, Fig. S1). These scores 171 generalize to another metric, "RDM", based on representational similarity without any fitting (Fig. S2a). The Blank2014 172 dataset is also reliably predicted, but with lower predictivity. Models vary substantially in their ability to predict neural data. 173 Generally, embedding models such as GloVe do not perform well on any dataset. In contrast, recurrent networks such as skip-174 thoughts, as well as transformers such as BERT, predict large portions of the data. The model that predicts the human data 175 best across datasets is GPT2-xl, a unidirectional-attention transformer model, which predicts Pereira2018 and Fedorenko2016 176 at close to 100% of the noise ceiling and is among the highest-performing models on Blank2014 with 32% normalized 177 predictivity. These scores are higher in the language network than other parts of the brain (SI-4). Intermediate layer 178 representations in the models are most predictive, significantly outperforming representations at the first and output layers 179 (Figs. 2c, S13). 180 Model scores are consistent across experiments/datasets. To test the generality of the model representations, we examined the 181 consistency of model brain scores across datasets. Indeed, if a model achieves a high brain score on one dataset, it tends to 182 also do well on other datasets (Fig. 2d), ruling out the possibility that we are picking up on spurious, dataset-idiosyncratic 183 predictivity, and suggesting that the models' internal representations are general enough to capture brain responses to 184 diverse linguistic materials presented visually or auditorily, and across three independent sets of participants. Specifically, 185 model brain scores across the two experiments in Pereira2018 (overlapping sets of participants) correlate at r=.94 (Pearson 186 here and elsewhere, p<<.00001), scores from Pereira2018 and Fedorenko2016 correlate at r=.50 (p<.001), and from 187 Pereira2018 and Blank2014 at r=.63 (p<.0001). 188 189 Next-word-prediction task performance selectively predicts brain scores. In the critical test of which computations might 190 underlie human language understanding, we examined the relationship between the models' ability to predict an upcoming 191 word and their brain scores. Words from the Wikitext-2 dataset (Merity et al., 2016) were sequentially fed into the candidate 192 models. We then fit a linear classifier (over words in the vocabulary; n=50k) from the last layer's feature representation 193 (frozen, i.e. no finetuning) on the training set to predict the next word, and evaluated performance on the held-out test set 194 (Methods_8). Indeed, next-word-prediction task performance robustly predicts brain scores ( Fig. 3a; r=.44, p<.01, averaged 195 across datasets). The best language model, GPT2-xl, also achieves the highest brain score (see previous section). This 196 relationship holds for model variants within each model class-embedding models, recurrent networks, and transformers-197 ruling out the possibility that this correlation is due to between-class differences in next-word-prediction performance. 198 To test whether next-word prediction is special in this respect, we asked whether model performance on any language task 199 correlates with brain scores. As with next-word prediction, we kept the model weights fixed and only trained a linear readout. 200 We found that performance on tasks from the GLUE benchmark collection ( We compared 43 computational models of language processing (ranging from embedding to recurrent and bi-and uni-directional transformer models) in their ability to predict human brain data. The neural datasets include: fMRI voxel responses to visually presented (sentence-by-sentence) passages (Pereira2018), ECoG electrode responses to visually presented (word-by-word) sentences (Fedorenko2016), fMRI region of interest (ROI) responses to auditorily presented ~5min-long stories (Blank2014). For each model, we plot the normalized predictivity ('brain score'), i.e. the fraction of ceiling (gray line; Methods_7, Fig. S1) that the model can predict. Ceiling levels are .32 (Pereira2018), .17 (Fedorenko2016), and .20 (Blank2014). Model classes are grouped by color (Methods_5 , Table S10). Error bars (here and elsewhere) represent median absolute deviation over subject scores. (b) Normalized predictivity of GloVe (a low-performing embedding model) and GPT2-xl (a highperforming transformer model) in the language-responsive voxels in the left hemisphere of two representative participants from Pereira2018 (also Fig. S3). (c) Brain score per layer in GPT2-xl. Middle-to-late layers generally yield the highest scores for Pereira2018 and Blank2014 whereas earlier layers better predict Fedorenko2016. This difference might be due to predicting individual word representations (within a sentence) in Fedorenko2016, as opposed to whole-sentence representations in Pereira2018. (d) To test how well model brain scores generalize across datasets, we correlated i) two experiments with different stimuli (and some participant overlap) in Pereira2018 (obtaining a very strong correlation), an ii) Pereira2018 brain scores with the scores for each of Fedorenko2016 and Blank2014 (obtaining lower but still highly significant correlations). Brain scores thus tend to generalize across datasets, although differences between datasets exist which warrant the full suite of datasets. 2018)-including grammaticality judgments, sentence similarity judgments, and entailment-does not predict brain scores 203 ( Fig. 3b-c). The difference in the strength of correlation between brain scores and the next-word prediction task performance 204 vs. the GLUE tasks performance is highly reliable (p<<0.00001, t-test over 1,000 bootstraps of scores and corresponding 205 correlations; Methods_9). This result suggests that optimizing for predictive representations may be a critical shared objective 206 of biological and artificial neural networks for language, and perhaps more generally ( Brain scores and next-word-prediction task performance correlate with behavioral scores. Beyond internal neural 210 representations, we tested the models' ability to predict external behavioral outputs because, ultimately, in integrative 211 benchmarking, we strive for a computationally precise account of language processing that can explain both neural response 212 patterns and observable linguistic behaviors. We chose a large corpus (n=180 participants) of self-paced reading times for Specific models accurately predict reading times. We regressed each model's last layer's feature representation (i.e., closest to the 219 output) against reading times and evaluated predictivity on held-out words. As with the neural datasets, we observed a 220 Figure 3: Model performance on a next-word-prediction task selectively predicts brain scores. (a) Next-word-prediction task performance was evaluated as the surprisal between the predicted and true next word in the WikiText-2 dataset of 720 Wikipedia articles, or perplexity (x-axis, lower is better; training only a linear readout leading to worse perplexity values than canonical fine-tuning, see Methods-8). Next-word-prediction task scores strongly predict brain scores across datasets (inset: this correlation is significant for two individual datasets: Pereira2018 and Blank2014; the correlation for Fedorenko2016 is positive but not significant). (b) Performance on diverse language tasks from the GLUE benchmark collection does not correlate with overall or individual-dataset brain scores (inset; SI-5; training only a linear readout). (c) Correlations of individual tasks with brain scores. Only improvements on next-word prediction lead to improved neural predictivity. Performance on other language tasks (from the GLUE benchmark collection) does not correlate with behavioral scores (Fig. S7). spread of model ability to capture human behavioral data, with models such as GPT2-xl and AlBERT-xxlarge predicting these 221 data close to the noise ceiling ( Fig. 4a; also Merkx & Frank, 2020; Wilcox et al., 2020). 222 Brain scores correlate with behavioral scores. To test whether models with the highest brain scores also predict reading times 223 best, we compared models' neural predictivity (across datasets) with those same models' behavioral predictivity. Indeed, 224 we observed a strong correlation ( Fig. 4b; r=.65, p<<.0001), which also holds for the individual neural datasets (inset and 225 Fig. S6). These results suggest that further improving models' neural predictivity will simultaneously improve their 226 behavioral predictivity. 227 Next-word-prediction task performance correlates with behavioral scores. Next-word-prediction task performance is predictive of 228 reading times ( Fig. 4c and thus connecting all three measures of performance: brain scores, behavioral scores, and task performance on next-word 230 prediction. 231 232 Model architecture contributes to model-to-brain relationship. The brain's language network plausibly arises through a 233 combination of evolutionary and learning-based optimization. In a first attempt to test the relative importance of the models' 234 intrinsic architectural properties vs. training-related features, we performed two analyses. First, we found that architectural 235 features (e.g. number of layers) but neither of the features related to training (e.g. dataset and vocabulary size) significantly 236 predicted improvements in model performance on the neural data (S10 , Table S11). These results align with prior studies that 237 had reported that architectural differences affect model performance on normative tasks like next-word prediction after 238 training, and define the representational space that the model can learn ( predictivity of ~51%, only ~20% lower than the trained network. A similar trend is observed across models: training generally 245 improves brain scores, on average by 53%. Across models, the untrained scores are strongly predictive of the trained scores 246 (r=.74, p<<.00001), indicating that models that already perform well with random weights improve further with training. 247 To ensure the robustness and generalizability of the results for untrained models, and to gain further insights into these 248 results, we performed four additional analyses (Fig. S9). First, we tested a random context-independent embedding with 249 equal dimensionality to the GPT2-xl model but no architectural priors and found that it predicts only a small fraction of the 250 neural data, on average below 15%, suggesting that a large feature space alone is not sufficient (Fig. S9a). Second, to ensure 251 Figure 5: Model architecture contributes to the model-brain relationship. We evaluate untrained models by keeping weights at their initial random values. The remaining representations are driven by architecture alone and are tested on the neural datasets (Fig. 2). Across the three datasets, architecture alone yields representations that predict human brain activity considerably well. On average, training improves model scores by 53%. For Pereira2018, training improves predictivity the most whereas for Fedorenko2016 and Blank2014, training does not always change-and for some models even decreases-neural scores (Fig. S8). The untrained model performance is consistently predictive of its performance after training across and within (inset) datasets. that the overlap between the linguistic materials (words, bigrams, etc.) used in the train and test splits is not driving the 252 results, we quantified the overlap and found it to be low, especially for bi-and tri-grams (Fig. S9b). Third, to ensure that the 253 linear regression used in the predictivity metric did not artificially inflate the scores of untrained models, we used an 254 alternative metric -"RDM" -that does not involve any fitting. Scores of untrained models on the predictivity metric 255 generalized to scores on the RDM metric (Fig. S9d). Finally, we examined the performance of untrained models with a trained 256 linear readout on the next-word prediction task and found similar performance trends to those we observed for the neural 257 scores (Fig. S9c) Our results, summarized in Fig. 6, show that specific ANN language models can predict human neural and behavioral 264 responses to linguistic input with high accuracy: the best models achieve, on some datasets, perfect predictivity relative to 265 the noise ceiling. Model scores correlate across neural and behavioral datasets spanning recording modalities (fMRI, ECoG, 266 reading times) and diverse materials presented visually and auditorily across three sets of participants, establishing the 267 robustness and generality of these findings. Critically, both neural and behavioral scores correlate with model performance 268 on the normative next-word prediction task -but not other language tasks. Finally, untrained models with random weights 269 (and a trained linear readout) produce representations beginning to approximate those in the brain's language network. 270 271 Predictive language processing. Underlying the integrative modeling framework, implemented here in the cognitive domain of 272 language, is the idea that large-scale neural networks can serve as hypotheses of the actual computations conducted in the 273 brain. We here identified some models-unidirectional-attention transformer architectures-that accurately capture brain 274 activity during language processing. We then began dissecting variations across the range of model candidates to explain why 275 they achieve high brain scores. Two core findings emerged, both supporting the idea that the human language system is 276 optimized for predictive processing. First, we found that the models' performance on the next-word prediction task, but not 277 other language tasks, is correlated with neural predictivity (see (Gauthier & Levy, 2019) for related evidence of fine-tuning of 278 one model on tasks other than next-word-prediction leading to worse model-to-brain fit it is simple, unsupervised, scalable, and appears to produce the most generally useful, successful language representations. 282 This is likely because language modeling encourages a neural network to build a joint probability model of the linguistic signal, 283 which implicitly requires sensitivity to diverse kinds of regularities in the signal. 284 Figure 6 (Overview of results): Connecting neural mechanisms, behavior, and computational task (next-word prediction). Specific ANN language models are beginning to approximate the brain's mechanisms for processing language (middle gray box). For the neural datasets (fMRI and ECoG recordings; top, red), and for the behavioral dataset (self-paced reading times; bottom right, orange), we report i) the value for the model achieving the highest predictivity, and ii) the average improvement on brain scores across models after training. Model performances on the next-word-prediction task (WikiText-2 language modeling perplexity; bottom left, blue) predict brain and behavioral scores; and brain scores predict behavioral scores (circled numbers).

285
Second, we found that the models that best match human language processing are precisely those that are trained to predict 286 the next word. Predictive processing has advanced to the forefront of theorizing in cognitive science ( possibility is therefore that both the human language system and successful ANN models of language are optimized to predict 296 upcoming words in the service of efficient meaning extraction. 297 298 Going beyond the broad idea of prediction in language, the work presented here validates, refines, and computationally 299 implements an explicit account of predictive processing: for the first time in the neuroscience of language, we were able to 300 accurately predict (relative to the noise ceiling) activity across voxels as well as neuronal populations in human cortex during 301 the processing of sentences. We quantitatively test the predictive processing hypothesis at the level of voxel/electrode/fROI 302 responses and, through the use of end-to-end models, related neural mechanisms to performance of models on 303 computational tasks. Moreover, we were able to reject multiple alternative hypotheses about the objective of the including judgments about syntactic and semantic properties of sentences, was not predictive of brain or behavioral scores. 306 The best-performing computational models identified in this work serve as computational explanations for the entire 307 language processing pipeline from word inputs to neural mechanisms to behavioral outputs. These best-performing models 308 can now be further dissected, as well as tested on new diverse, linguistic inputs in future experiments, as discussed below. 309 310 Importance of architecture. We also found that architecture is an important contributor to the models' match to human brain 311 data: untrained models with a trained linear readout performed well above chance in predicting neural activity, and this 312 finding held under a series of controls to alleviate concerns that it could be an artifact of our training or testing methodologies 313 (Fig. S9). This result is consistent with findings in models of early ( accurately, selective breeding with genetic modification: structural changes are tested and the best-performing ones are 322 incorporated into the next generation of models. Importantly, this process still optimizes for language modeling, only 323 implicitly and on a different timescale from biological and cultural evolutionary mechanisms conventionally studied in brain 324 and language. 325 326 More explicitly, but speculatively, it is possible that transformer networks can work as brain models of language even without 327 extensive training because the hierarchies of local spatial filtering and pooling as found in convolutional as well as attention-328 based networks are a generally applicable brain-like mechanism to extract abstract features from natural signals. coverage of representations. Relatedly, an idea during early work on perceptrons was to have random projections of input 337 data into high-dimensional spaces and to then only train thin readouts on top of these projections. This was motivated by 338 Cover's theorem which states that non-linearly separable data can likely be linearly separated after projection into a high-339 dimensional space (Cover, 1965). These ideas have successfully been applied to kernel machines (Rahimi & Recht, 2009) and 340 are more recently explored again with deep neural networks (Frankle et al., 2019); in short, it is possible that even random 341 features with the right multiscale structure in time and space could be more powerful for representing human language than 342 is currently understood. Finally, it is worth noting that the initial weights in the networks we study stem from weight initializer 343 distributions that were chosen to provide solid starting points for contemporary architectures and lead to reasonable initial 344 representations that model training further refines. These initial representations could thus include some important aspects 345 of language structure already. A concrete test for these ideas would be the following: construct model variants that average 346 over word embeddings at different scales and compare these models' representations with those of different layers in 347 untrained transformer architectures as well as the neural datasets. More detailed analyses, including minimal-pair model 348 variant comparisons, will be needed to fully separate the representational contributions of architecture and training. 349 350 Limitations and future directions. 351 These discoveries pave the way for many exciting future directions. The most brain-like language models can now be 352 investigated in richer detail, ideally leading to intuitive theories of their inner workings. Such research is much easier to 353 perform on models than on biological systems given that all their structure and weights are easily accessible and manipulable 354 ( Here, we worked with off-the-shelf models, and compared their match to neural data based 358 on their performance on the next-word-prediction task vs. other tasks. Re-training many models on many tasks from scratch 359 might determine which features are most important for brain predictivity, but is currently prohibitively expensive due to the 360 vast space of hyper-parameters. Further, the fact that language modeling is inherently built into the evolution of language 361 models by the NLP community, as noted above, may make it impossible to fully eliminate its influences on the architecture 362 even for models trained from scratch on other tasks. Similarly, here, we leveraged existing neural datasets. This work can be 363 expanded in many new directions, including a) assembling a wider range of publicly available language datasets for model 364 testing naturalistic dialogs/conversations); c) modeling the fine-grained temporal trajectories of neural responses to language in data 367 with high temporal resolution (which requires computational accounts that make predictions about representational 368 dynamics); and d) querying models on the sentence stimuli that elicit the strongest responses in the language network to 369 generate hypotheses about the critical response-driving feature/feature spaces, and perhaps to discover new organizing 370 principles of the language system (cf. Blank2014, which uses story materials and may therefore require long-range contexts. 397 398 Another key missing piece in the mechanistic modeling of human language processing is a more detailed mapping from model 399 components onto brain anatomy. In particular, aside from the general targeting of the fronto-temporal language network, it 400 is unclear which parts of a model map onto which components of the brain's language processing mechanisms. models of human language processing identified in this work might also serve to uncover these kinds of anatomical 406 distinctions for the brain's language network -perhaps, akin to vision, groups of layers relate to different cortical regions and 407 uncovering increased similarity to neural activity of one group over others could help establish a cortical hierarchy. The brain 408 network that supports higher-level linguistic interpretation-which we focus on here-is extensive and plausibly contains 409 meaningful functional dissociations, but how the network is precisely subdivided and what respective roles its different 410 components play remains debated. Uncovering the internal structure of the human language network, for which intracranial 411 recording approaches with high spatial and temporal resolution may prove critical (Mukamel & Fried, 2012;Parvizi & Kastner, 412 2018), would allow us to guide and constrain models of tissue-mapped mechanistic language processing. More precise brain-413 to-model mappings would also allow us to test the effects of perturbations on models and compare them against perturbation 414 effects in humans, as assessed with lesion studies or reversible stimulation. More broadly, anatomically and functionally 415 precise models are a required software component of any form of brain-machine-interface. 416 417

Conclusions. 418
Taken together, our findings suggest that predictive artificial neural networks serve as viable hypotheses for how predictive 419 language processing is implemented in human neural tissue. They lay a critical foundation for a promising research program 420 synergizing high-performing mechanistic models of natural language processing with large-scale neural and behavioral 421 measurements of human language comprehension in a virtuous cycle of integrative modeling: testing model ability to predict 422 neural and behavioral measurements, dissecting the best-performing models to understand which components are critical 423 for high brain predictivity, developing better models leveraging this knowledge, and collecting new data to challenge and 424 constrain the future generations of neurally plausible models of language processing. 425 Fedorenko  are higher here than in the original study for 4 of the 5 participants); iii) spatially distributed noise common to all electrodes 918 was removed using a common average reference spatial filter between electrodes with line noise smaller than a predefined 919 threshold (electrodes connected to the same amplifier); and iv) a set of notch filters were used to remove the 60 Hz line noise 920 and its harmonics. Functional localization: Mirroring the fMRI approach, where we focused on language-responsive voxels, data analyses were 930 performed on signals extracted from language-responsive electrodes. These electrodes were defined in each participant using 931 the same localizer contrast as in the fMRI datasets. In particular, we examined electrodes in which the envelope of the high 932 gamma signal was significantly higher (at p<.01) for trials of the sentence condition than the nonword-list condition (for 933 details, see Fedorenko et al., 2016). 934

References
We constructed a stimulus-response matrix by i) averaging the z-scored high-gamma signal over the full presentation window 935 of each word in each sentence, resulting in 8 data points per sentence per language-responsive electrode (97 electrodes total 936 across the 5 participants; 47, 8, 9, 15, and 18 for participants S1 through S5, respectively), and ii) concatenating all words in 937 all sentences (416 words across the 52 sentences), yielding a 416x97 matrix. 938 To examine differences in neural predictivity between language-responsive and other electrodes, we additionally extracted 939 high gamma signals from a set of 'stimulus-responsive' electrodes. Stimulus-responsive electrodes were defined as electrodes 940 in which the envelope of the high gamma signal for the sentence condition was significantly different (at p<0.05 by a paired-941 samples t-test) from the activity during the inter-trial fixation interval preceding the trial. This selection procedure resulted 942 in 67, 35, 20, 29, and 26 electrodes. As expected, this set of electrodes included many of the language-responsive electrodes; 943 for the analysis in SI-4, we exclude the language-responsive electrodes leaving 105 stimulus-(but not language-) responsive 944 electrodes. stories) were designed to be "deceptively naturalistic": they contained an over-representation of rare words and syntactic 950 constructions embedded in otherwise natural linguistic context. The stories were presented auditorily (each was ~5 min in 951 duration), and following each story, participants answered 6 comprehension questions (see Blank et al., 2014 for details of 952 the experimental procedure, data acquisition, and preprocessing). 953 Functional localization: As in the Pereira2018 dataset, data analyses were performed on fMRI BOLD signals extracted from 954 the language network. From each language-responsive voxel of each participant, the BOLD time-series for each story was 955 extracted. Across the eight stories, the BOLD time-series included 1,317 time-points (TRs, time of repetition; TR=2s and 956 corresponds to the time it takes to acquire the full set of slices through the brain). To align the neuroimaging data with the 957 story text, we first split the text into consecutive 2-second intervals (corresponding to the fMRI TRs) based on the auditory 958 recording; if a word straddled boundaries of intervals, it was assigned to the 2s interval in which that spoken word ended. 959 Each of the resulting intervals thus included a story "fragment", which could be a full short sentence, part of a longer sentence, 960 or a transition between the end of one sentence and the beginning of another. Due to the temporal resolution of the HRF, 961 whose peak's latency is 4-6 seconds, we assumed that each time-point in the BOLD signal represented activity elicited by the 962 text fragment that occurred 4s (i.e., 2 TRs) earlier. 963 We constructed a stimulus-response matrix by i) averaging the BOLD signals corresponding to each TR in each story across 964 the voxels within each ROI of each participant (averaging across the voxels within ROIs was done to increase the signal-to-965 noise ratio), resulting in 1 data point per TR per language-responsive ROI of each participant (60 ROIs total across the 5 966 participants), and ii) concatenating all story fragments (1,317 'stimuli'), yielding a 1,317x60 matrix. 967 968 4. Behavioral dataset: Self-paced reading (Futrell2018). We used the data from Futrell et al. (2018) (n=179). (The set of  969 participants excludes 1 participant for whom data exclusions-see below-left only 6 data points or fewer.) Stimuli consisted 970 of ten stories from the Natural Stories Corpus (same materials as those used in Blank2014, plus two additional stories), and 971 any given participant read between 5 and all 10 stories. The stories were presented online (on Amazon's Mechanical Turk 972 platform) visually in a dashed moving window display-a standard approach in behavioral psycholinguistic research (Just et 973 al., 1982). In this approach, participants press a button to reveal each consecutive word of the sentence or story; as they press 974 the button again, the word they just saw gets converted to dashes again, and the next word is uncovered. The time between 975 button presses provides an estimate of overall language comprehension difficulty, and has been shown to be robustly questions correctly, and we excluded reading times (RTs) that were shorter than 100 ms or longer than 3000 ms. 980 981 We constructed a stimulus-response matrix by i) obtaining the RTs for each word in each story for each participant (848,762 982 RTs total across the 179 participants; 338 average, ±173 std. dev.), and ii) concatenating all words in all sentences (10,256 983 words across 485 sentences), yielding a 10,256x179 matrix. 984 985 Supplement 1129 S1: Ceiling estimates for neural and behavioral datasets 1130 S2: Scores generalize across metrics and layers 1131 S3: Brain surface visualization of model predictivity scores 1132 S4: Language specificity 1133 S5: Model performance on diverse language tasks vs. model-to-brain fit 1134 S6: Model's neural predictivity for each dataset is correlated with behavioral predictivity 1135 S7: Performance on next-word prediction selectively predicts model-to-behavior fit. 1136 S8: Model architecture contributes to brain predictivity and untrained performance predicts trained performance 1137 S9: Controls for untrained models 1138 S10: Effects of model architecture and training on neural and behavioral scores 1139 S11: Overview of model designs 1140 S12: Distribution of layer preference (best performing layer) per voxel for GPT2-xl for Pereira2018 1141 1142 1143 Figure S1: Ceiling estimates for neural and behavioral datasets. Due to intrinsic noise in biological measurements, we 1144 estimated a ceiling value to reflect how well the best possible model of an average human could perform, based on sub-1145 samples of the total set of participants (see Methods-7). For each sub-sample, − 1 participants are used to predict a held-1146 out participant (except in Futrell2018, where this is done on split-halves, as described in the text). Each dot represents a 1147 correlation between the average scores of the − 1 participants and the left-out participant for a random sub-sample of the 1148 number of participants indicated on the x-axis. We then bootstrapped 100 random combinations of those dots to 1149 extrapolate (gray lines) the highest possible ceiling if we had an infinite number of participants at our disposal. The parameters 1150 of these bootstraps are then aggregated by taking the median to compute an overall estimated ceiling (dashed gray line with 1151 95% CI in error-bars). We use this estimated ceiling to normalize model scores and here also report the number of participants 1152 at which the estimated ceiling would be met (which show that for Pereira2018 and Futrell2018, the number of participants 1153 we have is at and close to the asymptote value, respectively). Ceiling levels are .32 (Pereira2018), .17 (Fedorenko2016), .20 1154 (Blank2014), and .76 (Futrell2018). 1155 1156 Figure S2: Scores generalize across metrics and layers. a) Model scores on each dataset generalize across different choices 1157 of a similarity metric; here we plot the predictivity metric used in the manuscript on the x-axis against a model-to-brain 1158