The entire brain, more or less, is at work: ‘Language regions’ are artefacts of averaging

Models of the neurobiology of language suggest that a small number of anatomically fixed brain regions are responsible for language functioning. This observation derives from centuries of examining brain injury causing aphasia and is supported by decades of neuroimaging studies. The latter rely on thresholded measures of central tendency applied to activity patterns resulting from heterogeneous stimuli. We hypothesised that these methods obscure the whole brain distribution of regions supporting language. Specifically, cortical ‘language regions’ and the corresponding ‘language network’ consist of input regions and connectivity hubs. The latter primarily coordinate peripheral regions whose activity is variable, making them likely to be averaged out following thresholding. We tested these hypotheses in two studies using neuroimaging meta-analyses and functional magnetic resonance imaging during film watching. Both converged to suggest that averaging over heterogeneous words is localised to regions historically associated with language but distributed throughout most of the brain when not averaging over the sensorimotor properties of those words. The localised word regions are composed of highly central hubs. The film data shows that these hubs are not fixed. Rather, they are spatiotemporally dynamic, making connections with 44% of peripheral sensorimotor regions at any moment, and only appear in the aggregate over time. Results suggest that ‘language regions’ are an artefact of indiscriminately averaging across heterogeneous language representations and linguistic processes. Rather, these regions are mostly dynamic connectivity hubs coordinating whole-brain distributions of networks for processing the complexities of real-world language use, explaining why damage to them results in aphasia.


Introduction
Historical views of 'the organisation of language and the brain' derive from mid-19th century post-mortem lesion analyses. For the next 100 years, there existed an ongoing debate as to whether language is localised to specific brain regions or holistic, involving the whole brain. The holistic view lost ground in the 1960s when Norman Geschwind helped swing the pendulum back to a localizationist model 3 . Roughly, this model posits that a left posterior superior temporal lobe (or 'Wernicke's area'; but see) 4 and left inferior frontal gyrus (or 'Broca's area') are the anatomical loci for language comprehension and speech production, respectively 5 . These localisations and 'localizationism' have subsequently remained the standard view, likely for several reasons. Most prominently, beginning in the 1970s and continuing until now, tens of thousands of in vivo positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) studies of language appear to support these anatomical observations. Advances from neuroimaging have arguably been only marginally incremental over Geschwind's now 'classical' model. Specifically, contemporary models of the systems neurobiology of language recognise that 'language' is a complex process that can be decomposed into subprocesses that are localised to a few additional regions. These are all in close proximity to those in the classical model or their right hemisphere 'homologues' but not typically ascribed to that older model. They include more of the superior temporal lobe, as anterior as the temporal pole, middle temporal lobe, and premotor cortices [6][7][8][9] . This neo-localizationist view is communicated in the scientific literature using phrases like 'language regions' to describe portions of the brain and 'the language network' to describe commonly co-active regions of the brain . Indeed, these conceptualizations seem to be validated by the 1 anatomical regions emerging from neuroimaging meta-analyses, conducted across a wide variety of language stimuli and tasks (see Figure 1). This apparent consistency raises the question as to why those specific anatomical locations support speech perception and language comprehension and why damage to them results in aphasia. One obvious answer is that they mostly involve regions adjacent to those in which acoustic information from the cochlea first arrives in the neocortex. It stands to reason that nearby regions would act upon that auditory information and transform it into the putative representations and processes that distinguish mere sound from language (like phonemes and phonology). It would be inefficient for distant regions to contribute because of the time delays and metabolic costs. Indeed, studies using 'language localisers' suggest that 'the language network' processes only language information whereas more broadly distributed regions perform 'extraneous' and 'autonomous' functions 'not necessary for linguistic processing' [10][11][12][13] .

Averaging
An alternative model of the neurobiology of language explains differently the apparent anatomical consistency that is reflected by classical and contemporary models 14 . It starts with the behavioural observation that 'language' is not only complex but also ambiguous at all levels, from speech sounds to semantics, syntax, and discourse. To resolve this 'problem', a large body of empirical evidence suggests that the brain makes use of contextual information stored internally (e.g., in the form of knowledge and expectations) and available externally (e.g., in the form of observable speech associated mouth movements and co-speech gestures) [15][16][17][18] (for further examples, see) 14 . The memory and perceptual processes associated with internal and external context are distributed across the whole brain [15][16][17][18][19][20][21] . These distributed networks appear to predict language input, e.g., in primary auditory cortex 22 . Because context varies dynamically with each linguistic experience, language processing might never be the same twice 14,23 . As such, the distributed regions involved in predicting language input would themselves be highly variable and dynamic.
Such a model is in the unenviable position of not conforming to Occam's razor. That is, in addition to having to explain the consistency of regions appearing across studies (see Figure 1), it must explain why we do not typically observe a whole-brain distribution of other regions engaged in language processing. One explanation is that they are concealed by existing methodological paradigms. For example, all neuroimaging studies rely on measures of central tendency, like averaging. In context-aware neurobiological models of language like that outlined in the prior paragraph, every word has distributed and variable activity patterns. Thus, averaging over different words, whether individually or in n-grams, sentences, or discourse, would reduce the activity in those 'peripheral' regions with variable activity patterns to a low value. 'Language Regions' 4 This low value is even less likely to survive given the specific statistical analyses used. To illustrate, participants might listen to 100 different words during fMRI. The resulting data is a four-dimensional matrix of tens of thousands of 'voxels' collected at multiple time steps. Regressions are conducted in each of those voxels at the individual participant level, using a regressor that is a convolution of word onset times with a 'canonical' hemodynamic response function. Though this is not technically an average, it effectively acts as one by collapsing over different word representations. Resulting regression coefficients are then used to conduct group-level statistical analyses, again, at each voxel, resulting in average coefficients across participants. This is yet another level of collapsing over the different representations associated with each word as these may differ by participant. Finally, if any voxels exceed a statistical 'threshold' it is said that they are 'activated'. However, thresholding requires prohibitive corrections for multiple comparisons because of the number of statistical tests that are done . Thus, connected peripheral voxels with legitimate but variable activity for any individual 2 word are statistically unlikely to survive collapsing and averaging over word representations at the individual and group levels, particularly after thresholding. This would give the false impression that those vowels are not active and 'extraneous' . 3

Distributed
Consistent with this picture, there is a large amount of evidence that a distributed context-aware model of the neurobiology of language is more consistent with reality when analyses do not indiscriminately average over heterogeneous word representations. To give an example, word processing can be considered to be instantiated in a distributed neural ensemble that incorporates experiences associated with word learning 14,24 . Within this framework, verbs activate brain regions more associated with motion perception and limb movement whereas nouns activate regions more associated with object processing 25 . This is also true of finer representations, whereby lexical processing that invokes auditory, visual, somatosensory, movement, and emotional related meanings activate brain regions partially associated with those processes, e.g., the transverse temporal gyrus, calcarine sulcus, postcentral gyrus, central sulcus, and insula [26][27][28] . One cannot argue that these patterns simply represent post-linguistic 'conceptual' processing as they occur before this is possible, 50-150 ms after word onset [29][30][31][32][33] . 3 Thus, the word 'averaging' as used throughout is a convenient simplification for this particular set of analyses steps and other similar neuroimaging analyses. Arguably, all neuroimaging studies involve measures of central tendency or averaging in this sense.
Results like these suggest that language processing is distributed across virtually the whole brain and that most of the evidence for this from imaging studies is obscured by averaging words from different semantic categories. A similar conclusion can be drawn from other linguistic representations. For example, overlearned 'formulaic' language like 'you know' seems to not involve classical 'language regions' at all, explaining why they are often preserved after severe damage, as in global aphasia 34,35 . A similar argument can be made not only for linguistic representations but other linguistic processes. For example, syntactic processing is arguably distributed 36 , with different brain regions participating in disparate syntactic functions when not collapsed over 37 . These include regions outside of the classical and contemporary 'language regions', e.g., the basal ganglia, pre-supplementary motor area, and insula 38,39 .
Compounding the issue, the act of averaging over all of these linguistic representations and processes typically conceals the distributed patterns associated with individual differences. For example, left-handed participants activate the right more than left motor cortices for verbs like 'throw', and the converse is true for right-handers 40 . Similarly, professional hockey players activate motor cortices for hockey-related sentences compared to fans and novices 41,42 . More generally, we have known for more than 20 years that individual differences in patterns of language-associated activity do not correspond well to group averages 43 . Group analysis by clustering shows extensive variability during language comprehension, with no one group of participants capturing the aggregate and individual participants varying on a spectrum involving the relative contribution of multiple neural structures, e.g., visual and sensorimotor regions 44 .

Hubs
We return to the question of why 'language regions' or 'the language network' remain after averaging if standard methodological practices obscure most of the participating brain regions? One suggestion derives from the fact that the network organisation of the brain is neither random nor uniform. Rather, it has a small-world topology, characterised by the presence of hubs 45 . Hubs are highly central regions (i.e., that have a high degree), with a large number of connections to other regions 46 . This leads to the hypothesis that 'language regions' are a combination of auditory regions due to persistent audio input (at least in spoken language) and hubs necessary for the coordination of information processing in more dynamic, variable, and distributed peripheral regions.
Indeed, empirical evidence hints at the possibility that the regions identified in classical and contemporary models of the neurobiology of language are connectivity hubs. Structural MRI [47][48][49] and 'resting-state' fMRI studies [50][51][52][53][54][55][56][57][58][59][60] suggest that portions of superior and middle temporal and inferior frontal gyri are hubs. Though there are few studies measuring hubs using auditory or language stimuli, those that exist again suggest that these regions are functional hubs, encompassing even 'early' auditory cortices 61,62 . However, across all structural, resting-state, and task-based studies, the location of hubs in this set of language-associated regions is quite variable. For example, hubs variously encompass portions of the anterior, middle, or posterior superior temporal gyrus and sulcus, suggesting they might be dynamic 63 .

Hypotheses and Studies
To summarise, there is a conflict between the localizationist claims about 'language regions' and 'the language network' and data demonstrating that language processing occurs in a widely distributed set of brain regions. An alternative to localizationism is that these networks are not more obvious because averaging (and thresholding) over the more variable and distributed regions leaves only auditory input regions and connectivity hubs that coordinate those more distributed peripheral regions. As we are not aware of any work empirically demonstrating this, we conducted two neuroimaging-based studies to test the following four hypotheses: -H1-Averaging: Averages over disparate linguistic representations and processes are localised to 'language regions'.
-H2-Distributed: When not averaged over, specific linguistic representations and processes are distributed throughout the whole brain.
-H3-Hubs: The 'language regions' that survive averaging are auditory input regions and connectivity hubs.
-H4-Dynamics: Language-associated connectivity hubs are dynamic, not fixed, and appear only in the aggregate.

Meta-Analyses
We test the first hypothesis by conducting a meta-analysis of neuroimaging meta-analyses or a 'meta-meta-analysis' of language 64 . In doing so, we attempt to demonstrate that studies that average over a variety of linguistic representations and processes consistently show activity in the same 'language regions' (H1-Averaging). However, this would not adjudicate between the veracity of the localizationist view versus the view that averaging conceals distributed regions. For this, we perform a second set of meta-analyses to determine if language processing in the brain is distributed when comparing even gross linguistic representations like verbs and nouns (H2-Distributed). Next, we conduct a meta-analytic centrality analysis to test the third hypothesis that the repeatedly activated 'language regions' expected in the meta-meta-analysis are hubs (H3-Hubs). As we discuss next, these meta-analyses can serve as only an approximate validation of hypotheses 1-3 and the fourth hypothesis (i.e., H4-Dynamics) cannot easily be tested using meta-analytical methods, requiring an independent fMRI study.

NNDb
That is, even if the meta-meta-analysis reveals only 'language regions' and the verb and noun meta-analyses reveal a distributed pattern, the former cannot be derived by averaging over the latter.
This and determining the relationship between 'language regions' and connectivity hubs must be demonstrated within participants. Furthermore, meta-analyses are static and cannot say anything about the dynamics of activity (required for H4-Dynamics). Thus, we used another approach, i.e., analysing film-fMRI data from our 'Naturalistic Neuroimaging Database' (NNDb) 65 . First, we test the hypotheses that language processing appears circumscribed to 'language regions' when using a measure of central tendency over words in the films participants watched (H1-Averaging) but widely distributed when examining finer sensorimotor representations of those words (H2-Distributed). We tested the third hypothesis by determining the extent to which these averaged word regions are hubs, calculating this using multiple measures of voxel-wise network centrality (H3-Hubs). Finally, the fourth hypothesis that language-associated connectivity hubs are dynamic and not fixed was tested by making use of the temporally extended and context-rich nature of the film-fMRI data in the NNDb (H4-Dynamics).
Specifically, the fourth hypothesis derives from the context-aware model discussed above (see 'Averaging'). This model proposes that the forms of context used during naturalistic language processing vary dynamically and are therefore coordinated by spatially and temporally varying hubs.
For example, at one moment speech-associated mouth movements might be used to predict forthcoming speech sounds, involving more posterior superior temporal hubs whereas at another moment iconic co-speech gestures might be used to predict upcoming words, involving more anterior superior temporal hubs [15][16][17][18]22 . As such, we hypothesised that coordinating hubs are not a fixed set that always appear together when language is being processed. Rather, what have come to be called 'language regions' or 'the language network' are a virtual set that appears only collectively, when aggregating over an extended period of processing time (H4-Dynamics). To test this hypothesis, we derived our measures of voxel-wise network centrality using dynamic functional connectivity at overlapping running windows so that we could perform analysis both with and without aggregating over the film-fMRI data.

Averaging
We first conducted a neuroimaging 'meta-meta-analysis' to localise 'language regions' and to determine if putatively different language representations and processes activate similar 'language regions' across different meta-analyses. Such a result would be expected both under the hypothesis that 'language regions' are a self-contained localised system or network and the hypothesis that they are primarily a product of averaging. Specifically, 85 language-related meta-analyses were conducted (Table 1) and thresholded, correcting for multiple comparisons. Results were combined into a single brain image by count, followed by a one-sample 'group-level' statistical analysis, again correcting for multiple comparisons. Table 1. Terms and searches used to conduct the meta-meta-analysis ( Figure 1). The results of this analysis revealed a circumscribed distribution of voxels activated in one or more language-related meta-analyses ( Figure 1; Table 2). On average, each voxel was activated in 6.36 of the meta-analyses, with a maximum single voxel activation in 45 meta-analyses (in the left dorsal superior temporal sulcus). Significant temporal lobe regions were activated across meta-analyses in 'Language Regions ' 9 both hemispheres, including the transverse temporal gyri, plana polare and temporale, superior temporal gyri and sulci, and posterior middle temporal gyri ( Figure 1, white outline; Table 2).

Meta-Analyses
Significant activity in the left hemisphere included the posterior inferior frontal gyrus, ventral and dorsal precentral sulcus and gyrus, and medial superior frontal gyrus ( Figure 1, white outline; Table 2).

Figure 1.
Neuroimaging meta-meta-analysis of language. Eighty-five meta-analyses of language representations (e.g., phonemes, words, sentences) and associated processes (e.g., speech, semantics, syntax) were conducted using the Neurosynth (N = 57) and BrainMap (N = 28) databases (  Table 2). The main effect of words from the NNDb study is shown with a black outline for comparison (see Figure 4). Second, we performed neuroimaging meta-analyses of 'verbs' and 'nouns' as an example to determine the spatial distribution of language representations when more specific linguistic representations are not averaged over, as is typically the case. This would provide evidence as to whether or not the 'language regions' in the meta-meta-analysis are likely to be the result of averaging. Indeed, verbs and nouns produce a whole brain distribution of activity ( Figure 2, colours) that is not encompassed by the significant 'language regions' from the meta-meta-analysis (  language meta-meta-analysis in Figure 1 are shown as a white outline. The main effect of words from the NNDb fMRI study is shown with a black outline for comparison (see Figure 4).

Hubs
Finally, we developed a new meta-analytic approach to quantifying degree centrality in order to evaluate whether or not the 'language regions' in the meta-meta-analysis are network connectivity hubs 66,67 . If they are hubs, this would suggest that 'language regions' in the meta-meta-analysis remain after averaging over heterogeneous linguistic representations (like verbs and nouns) because they are hubs coordinating more distributed linguistic representations and processes. Specifically, we conducted 165,953 voxel-wise co-activation meta-analyses across 14,371 studies. As with functional connectivity, the regular co-activation of two or more regions suggests that those regions form functional connections or a network. Each meta-analysis was thresholded and combined by count (i.e., degree) as a measure of centrality and converted to a z-score for later thresholding.
The results indicate that a circumscribed set of regions in the task-driven brain have high centrality ( Figure 3). These include many regions in the superior and middle temporal lobes, posterior inferior frontal, and precentral regions from the meta-meta-analysis ( Figure 3, white outline). Indeed, the spatial correlation between the unthresholded centrality and language meta-meta-analysis was r = 0.52.

Figure 3.
Neuroimaging meta-analytic connectivity hubs. Degree centrality was calculated using 165,953 voxel-wise co-activation meta-analyses across 14,371 studies, thresholded at q ≤ 0.01 false discovery rate corrected for multiple comparisons and combined by count. The mean and standard deviation across grey matter were used to convert counts to z-scores to make interpretation easier. Dark red regions are those that have z-scores ≥ 3.89 or p ≤ 0.0001. Significant voxels from the language meta-meta-analysis in Figure 1 are shown as a white outline. The main effect of words from the NNDb fMRI study is shown with a black outline for comparison (see Figure 4).

NNDb
The results from the neuroimaging meta-analyses suggest that 'language regions' and 'the language network' might be the product of averaging over different word representations (Figures 1-2), leaving only connectivity hubs ( Figure 3). However, the results are only suggestive because neither averaging nor connectivity hubs were generated within any one study, meaning that the observed findings might be due to some factor unrelated to our hypotheses.

Averaging
To overcome this issue and for other reasons discussed in the Introduction (see Hypotheses and Studies), we used film-fMRI data from 86 participants to average over the heterogeneous properties of words in the brain, with the hypothesis that the results would be similar to the language meta-meta-analysis ( Figure 1). Specifically, we used duration and amplitude modulated regression at the individual participant level, modelling words and allowing them to be modulated by 11 sensorimotor experiential dimensions associated with the meanings of those words (i.e., auditory, foot-leg, gustatory, hand-arm, haptic, head, interoceptive, mouth, olfactory, torso, and visual dimensions) 68 . We also included sound energy, contrast luminance, and word frequency as nuisance modulators and regressors for words without sensorimotor ratings and nonword film segments. For the group level analysis, the beta maps corresponding to the 'main effect' of words (henceforth 'word' or 'words') and the 11 sensorimotor modulators were entered into a linear mixed-effects model with beta, age, gender, and film as fixed effects and participant as a random effect. The results were corrected for multiple comparisons using a cluster-size correction with multiple voxel-wise thresholds to α = 0.01 and a minimal cluster of 20 voxels (540 microliters).
Consistent with our hypothesis, word processing in the brain was spatially limited (Figure 4). Regions of activity for words included the same bilateral superior and middle temporal lobe regions as in the meta-meta-analysis ( Figure 4, black outline). The predominant difference between these maps was the general lack of activity in the left inferior frontal and precentral gyri and sulci (compare the black and white outlines in Figure 4). We included word frequency as a nuisance regressor to ensure that sensorimotor activity was not driven by this property (or 'low-level' auditory and visual features).
However, it is well known that inferior frontal gyrus activity increases with decreasing word frequency [69][70][71] , suggesting a role in selection and/or retrieval demands 18,[72][73][74][75][76][77] . To determine if the observed lack of activity was accounted for by word frequency, we conducted another amplitude-modulated regression and group linear mixed-effects model, this time without the 11 sensorimotor modulators. Directly contrasting sound energy with word frequency demonstrates that the latter nearly completely covers the left inferior frontal and precentral gyri and sulci, among other regions (and conversely provides evidence for the efficacy of the sound energy nuisance regressor; Supplementary Figure S1).
For comparison, we next examined the similarity of activity patterns for word processing in this study with the results of the meta-meta-analyses. The percentage of thresholded word voxels that were also active in the language meta-meta-analysis was 52.64% (38,178 out of 72,522 microliters; compare the black and white outlines in Figure 4). If the voxels associated with word frequency are included (i.e., from the inferior frontal and precentral regions, see Supplementary Figure S1), that percentage increases to 72.60% (52,650 of 72,522 microliters). In contrast, the percentage of voxels from the 11 thresholded sensorimotor modulators of words (e.g., hand/arm) also in the language meta-meta-analysis was 7.42% (M = 5,380.36 of 72,522 microliters, SD = 10,279.11; also see the discussion in the next section). The spatial pattern of activity for the unthresholded word map was correlated with the unthresholded language meta-meta-analysis at r = 0.51. In contrast, the 11 sensorimotor modulators of words did not correlate with the language meta-meta-analysis on average (with the mean r = -0.01, SD = 0.13).

Distributed
In parallel with the verb/noun meta-analyses ( Figure 2), we analysed the fine-scale sensorimotor properties of words within participants, evaluating the regional distribution in the brain. In this analysis, sensorimotor processing was distributed throughout much of the brain (Figure 4, yellows and reds). Figure 4 shows the main effect of the 11 sensorimotor modulators (e.g., foot/leg, Figure 4, reds) and pairwise contrasts between modulators (e.g., foot/leg vs. hand/arm, Figure 4, yellows) averaged and projected onto the brain, along with voxels that were not significant in any of these comparisons ( Figure 4, blues The language meta-meta-analysis from Figure 1 is included as a white shaded in outline for comparison to words.

Hubs
In parallel with the overlap between the language meta-meta-analysis and the meta-analytic measure of centrality (Figure 3), we expected regions associated with words to be connectivity hubs. To test this hypothesis, we constructed individual voxel-wise networks in just grey matter using a sliding-window approach and averaging over four measures of network centrality (degree, eigenvector, betweenness, and closeness). We clustered the centrality values using Ward's minimum variance 78 , which divided voxels in each time window into low-(given a value of one) and high-centrality clusters (given a value of two).
To examine hubs aggregated across time, we first averaged across all windows and performed a linear mixed effects model for the group-level analysis. The fixed effects were centrality (high and low), age, gender, and film with participant as a random effect. A correction for multiple comparisons using a cluster-size correction with multiple voxel-wise thresholds was applied (again, using α = 0.01). Results show that high centrality voxels are significantly more active than low centrality voxels in most of the superior and middle temporal plane ( Figure 5, colours) and that the main effect of words overlap with these regions ( Figure 5, black outline). Indeed, 78.26% of the word voxels were high > low centrality (thresholded at α = 0.01 with a minimum individual voxel p-value ≤ 0.001). Even when thresholding results further to include only the top 90% of the high centrality voxels, there was still a 39.00% overlap. Collectively, these results suggest that the word hubs only appear in the aggregate over time (as in Finally, it might be the case that highly central word hubs are not connected to the sensorimotor periphery, e.g., if the periphery is 'extraneous' and processing is proceeding there independently somehow. To address this, we calculated the specific connectivity profiles of windows on an individual participant basis where the mean cluster assignment of the voxels from the word mask was ≥ 90% (i.e., ≥ 1.8, returning to the more categorical high/low windows) to the periphery defined, again, as the sensorimotor maps (i.e., the connectivity between the regions within the black outline and those in colour in Figures 1-4). This analysis revealed that when the word mask was a hub, it shared an average of 43.99% (SD = 1.33) of connections with peripheral voxels across all participants. This helps explain the low spatial correlation between the thresholded windows and the word map and the relatively low percentage of word masks acting as hubs at any given time window above. That is, when word voxels are hubs at any given moment, they are connected to a periphery, driving down correlations with the word map that has been aggregated over time, thus excluding the less central and more variable periphery.

Summary
Classical and contemporary models of the neurobiology of language suggest that there are a small number of fixed 'language regions'. Yet, this is juxtaposed with significant evidence that language processing is distributed throughout the entire brain. To reconcile these differences, we tested the hypothesis that 'language regions' result from using measures of central tendency and thresholding across heterogeneous language representations and processes. Such averaging minimises the activation in regions with greater spatial variance while leaving activation in auditory input regions (in the case of vocal/heard languages) and connectivity hubs that coordinate those more variable peripheral regions.
Indeed, using both neuroimaging meta-analyses and film-fMRI, we show that language processing seems at first blush to occur in a very circumscribed set of fixed brain regions (Figures 1 and 4).  Figure S1).
This consistency in localisation disappears when words are no longer treated homogeneously ( Figures   2 and 4). That is, meta-analyses of even the gross distinction between verbs and nouns suggest that large swaths of brain tissue participate in language processing ( Figure 2). When making even finer distinctions in the film-fMRI data, the distributed nature of processing becomes even more salient ( Figure 4, red and yellow). Specifically, individual (e.g., foot and leg, gustatory, interoceptive, and olfactory) sensorimotor properties of words produce distributed patterns of activity, averaging about 4% of the brain per each of 11 sensorimotor dimensions beyond word processing regions (over which those properties are averaged). Words consist of multiple sensorimotor properties and their conjoined activity encompasses up to 67% of the remainder of the brain. This includes regions important for processing action, emotions, interoception, movement, somatosensation, and vision (among others).
Thus, averaging and thresholding lead to misleading minimisation of activity in regions sometimes regarded as not language related or 'extraneous' (compare Figures 1 and 2; Figure 4). We hypothesised that this leaves only auditory processing regions centred around primary auditory cortex and regions of high connectivity that coordinate those distant but no longer observable regions. Indeed, most 'language regions' are connectivity hubs as defined by degree centrality in a large-scale meta-analytic connectivity analysis (Figure 3) or the average of four different measures of centrality in the NNDb data ( Figure 5; Supplementary Figures S2 and S3). However, additional analysis shows that these connectivity hubs are not fixed. Rather, they are spatiotemporally dynamic, with any individual hub occupying only some of what have become known as 'language regions' at any given moment and appearing together only when aggregating over time. On this moment-by-moment basis, language-related hubs are connected to a periphery of less central and more variable brain regions that are typically averaged away.

Local vs. Distributed
Collectively, these results beg for a reconsideration of localizationist accounts of the biological basis of language processing in the brain. Whether implicitly or overtly, it seems that the localizationist view of the 19th century, reconstituted in the imaging era of recent years, derives from confirmatory though potentially misleading results from thousands of neuroimaging studies. This is reflected in the widespread use of 'language localisers'. These are ostensibly used for good reasons, to account for individual variation in the location of language regions [79][80][81] , despite arguments against their usefulness [82][83][84] . Regardless, a localiser task typically involves listening to intelligible language and to less intelligible language (among other variants), with analysis that requires averaging over many different kinds of linguistic representations and processes in both conditions, subtracting them, and thresholding. It is therefore not surprising, based on the arguments advanced here, that across 45 languages and even constructed languages like Klingon and Dothraki, localisers simply and repeatedly show the same regions as seen in Figure 1 85,86 . The inappropriate adoption of the localisionist view is also reflected in contemporary model building in which language processing is said to be supported by only a small number of fixed regions 5,7,8,87,88 . The implication of our results is that these models are necessarily incomplete, in that they do not account for all regions or that they are wrong in claiming that only 'language regions' process language.
There are several counterarguments that might be proffered in favour of maintaining more localizationist norms. These include that the more distributed results that we observe, particularly those outside of 'language regions', are 1) not representative of language processing but some other process like imagery or 2) limited to only 'sensorimotor' representations and 'semantic' or 'conceptual' processing. We next address these counterarguments, demonstrating that our distributed results are not 'non-linguistic' and that they generalise to other language representations and processes. In the subsequent section, we further argue that a whole-brain model of the neurobiology of language is necessary for language comprehension and sketch a corresponding network architecture that might better account for the neurobiology of language than current models.

Doing Language?
A sceptic might argue that the 'language regions' we observe do in fact only process language and that the other more distributed regions are doing something possibly independent and nonlinguistic, like post-perceptual imagery, conceptual processing, or thought. The methods and stimuli that we used in the NNDb study when taken collectively help ensure that this is not the case. In the regression analysis, the sensorimotor properties were included as individual word amplitude modulators, meaning they are time-locked to word processing activity and thus not likely related to other processes. The model also included auditory and visual nuisance regressors to control for other features from the films that might covary with those words (Supplementary Figure S1). Moreover, speech in films is continuous, without pauses between words, making it unlikely participants spent time imagining or conceptualising those words.
Further bolstering this argument, previous studies support the claim that those more distributed regions are doing something linguistic. Specifically, 'language regions' and other sensorimotor regions form networks during word processing and the latter are active within 50-150 ms after the word onset [29][30][31][32] .
Together, such results suggest that the sensorimotor component of the activation is part of and inseparable from the distributed representation of the words themselves 24 . Our results support this view in that regions associated with words are connected to a large sensorimotor periphery on a moment-to-moment basis. This view is also consistent with results in other domains like vision and memory showing that representations are not confined to but are maintained across reciprocally interconnected neurons and regions [89][90][91] .
In contrast, the strongest claims involving 'language localisers' suggest that the putatively domain-general 'multiple demand network' (MDN) 'shows no sensitivity to linguistic variables' 92 .
This argument creates a false and sterile dichotomy 93 . First, the MDN is often sensitive to 'linguistic information' albeit at reduced levels 10,11,92,[94][95][96][97] . Second, analyses are done by averaging over large regions with different connectivity and cytoarchitecture profiles 98 , comprising about 25% of the grey 'Language Regions ' 25 matter voxels in the brain . This would lead to less sensitivity for detecting 'linguistic information', 6 particularly given the demonstrations herein, e.g., that this activity is dynamic and variable. Finally, MDN regions are appropriately named as they have some of the highest measures of functional diversity and 'neural reuse' in the human brain 93,[99][100][101][102][103] . These functions are most certainly not limited to domain-general processes. For example, premotor cortices play clear roles in speech perception and language comprehension 104 . Indeed, as we expand on in the next section, these putatively non-language regions do contribute specifically to language processing when analyses do not average over individual language representations and processes.

More Than Sensorimotor?
It is also unlikely that our argument that 'language regions' are the results of averaging is specific to only 'sensorimotor' representations and 'semantic' or 'conceptual' processing. That is, the entire set of Many other examples can be provided that non-'language regions' are involved in language beyond their role in sensorimotor representations. To give a non-exhaustive overview, orbitofrontal cortex, among other regions, plays a role in indirect speech act comprehension 105 . The dorsolateral prefrontal cortex (part of the putative MDN) plays numerous specific roles in 'discourse management, integration of prosody, interpretation of nonliteral meanings, inference making, ambiguity resolution, and error repair' 106 . Dorsal medial prefrontal regions are involved in understanding speech acts, nonliteral meanings, and emotionally and socially laden language 107 . These regions and the precuneus are involved in various aspects of generating and updating language-based situation models 108,109 . More generally, these regions comprise aspects of the 'default mode network' which have been directly linked to 'language regions', inner speech, and language comprehension [110][111][112][113][114] .
Other regions, like the anterior cingulate and parietal cortices, help implement necessary adaptive language control processes 115 . Entire distributed motor systems play many roles in supporting speech perception 104 . The insula and limbic structures like the amygdala are involved in affective prosody processing [116][117][118] . Other subcortical structures like the basal ganglia 38,119 , thalamus [120][121][122] , and cerebellum 123 play numerous linguistic roles, including the processing of speech, semantics, and syntax. Visual regions like the putative 'visual word form area' 124 and face and motion processing regions 125,126 play roles in audio-only speech perception, even when there is no visual information available. Indeed, it has long been recognized that there is a 'basal temporal language area' in the fusiform gyri in occipital cortices associated with receptive aphasia in the auditory modality [127][128][129][130][131][132] .

NOLB Model
Claiming that these various distributed language representations and processes are somehow non-or extra-linguistic or, worse, 'merely' pragmatic is problematic. It threatens to reduce our real-world understanding of language to some core elements that in themselves cannot explain how the human brain manages to generate and extract information through language, e.g., to syntactic recursion 133,134 or semantic compositionality 135 . However, we suggest there is an even more fundamental reason for rejecting linguistic and nonlinguistic dichotomies for a whole-brain neurobiological model. That is, doing so is part of an account that helps resolve a fundamental problem in speech perception and language comprehension, namely, how the brain overcomes linguistic ambiguity. Next, we elaborate on a mechanistic model (that we sketched out in the Introduction) by which the brain uses both linguistic and nonlinguistic contextual information to predict language representations, constraining interpretation 14 .
Language representations and their associated processes are ambiguous at all levels (for a review, see) 14 . For example, there are no known acoustic features that unambiguously distinguish phonemes (called the 'lack of invariance problem'). Words are not only homonymous (e.g., 'bat' and 'bank') but most are polysemous (e.g., the word 'set' has more than 450 meanings). Similarly, sentences and discourse can be syntactically and/or semantically ambiguous. How does the brain resolve all these ambiguities? We and others have taken proposals from Helmhotlz in vision ('unconscious inference', circa the 1860s) and Stevens and Halle in speech ('analysis-by-synthesis', circa the 1960s) and proposed that language ambiguity necessitates, in more modern parlance, prediction 14,104,136 .
Specifically, the brain uses the ubiquitous amount of internal context (in the form of memories) and external or observable context it has available in the real world to make predictions about forthcoming language representations. To give an example, observable speech-associated mouth movements precede associated auditory information by about 100-300 ms 137 . These can be used to make predictions about the acoustic patterns subsequently arriving in primary and surrounding auditory cortices and constrain the interpretation of which phonemes are intended. We have shown that this is how audiovisual speech perception works, involving feedforward and feedback network interactions between visual, ventral pre-and primary motor cortices, and posterior superior temporal cortices (among others) 15,17 .
Because language is ambiguous and requires continual predictions derived from context, and context is always variable, the brain regions involved in speech perception and language comprehension will necessarily be highly variable. As these distributed regions are making predictions, they form connected networks. The only common denominator in these spatially variable and distributed networks are auditory input and connectivity hubs coordinating them. This has two implications relevant here: First, because, e.g., visual and motor regions are predicting intended phonemes, the distinction between 'linguistic' and 'nonlinguistic' is fuzzy at best. Second, averaging over the various representations and processes occurring in different and more variable distributed networks leaves only the aforementioned auditory and hub regions after thresholding.
What kind of network architecture can accommodate such a model? A classically modular architecture is not capable of achieving the flexibility needed to accommodate continually changing contexts and associated networks. An alternative architecture incorporates core-periphery structures that combine two network dynamics: The core is a collection of stable hubs that control a set of flexible regions, or periphery, that manifest remarkable temporal variability 138,139 . Core-periphery networks allow for high complexity, robustness to perturbations and rewiring of connections, and recovery following injury 139,140 . Our data appear to fit into a core-periphery framework with the identified 'language regions' corresponding to dynamic cores and the rest of the distributed activity (that is typically averaged away) constituting a dynamic periphery. It is left to future work to test this architecture with a whole-brain, voxel-wise core-periphery algorithm 141 .

Implications
The studies and results here suggest that the historical neo-localizationist view of the neurobiology of language needs radical revision 14,142 . By obscuring the distributed and whole-brain nature of language processing systems, static neuroimaging studies have inadvertently reinforced inferences made from lesion analyses in the 19th century. The latter suggest that gross language dysfunctions are caused by damage to a small set of fixed regions. Similarly, neuroimaging analyses based on averaging over linguistic 'apples and oranges' and thresholding, give the false impression that language processing occurs in a small set of fixed regions. Results here suggest that both the language problems caused by injury and the localizationist inferences that can be drawn from neuroimaging can be explained by considering that 'language regions' and 'the language network' are actually auditory input regions and connectivity hubs. This implies that aphasia is the result of damaging the core regions that coordinate a whole-brain distribution of regions that process language and not the result of damaging a putative 'language network'. Indeed, empirical evidence demonstrates that damage to connectivity hubs is more likely to underlie various disorders, including aphasia [143][144][145][146][147][148] . This implies that a different (perhaps core-periphery) model of the neurobiology of language is needed to replace contemporary models to help advance treatment for aphasia (where effect sizes are low on average) 149 . This might involve focusing therapy around individualised and preserved linguistic representations and processes in the periphery.
More generally, it is perhaps obvious that our conclusions about averaging and thresholding can be extended to every domain in which psychological ontologies are probed with neuroimaging data 44,150 .
It is hard to think of any neuroimaging study in which stimuli do not contain multiple representations or processes that are mathematically integrated before thresholding. Therefore, what is revealed by most studies and neuroimaging meta-analyses is more likely to reflect underlying connectivity hubs rather than the full distribution of regions involved. This conclusion probably scales with the complexity of the behaviour under investigation. For example, visual object processing studies average over many different stimuli whose sensorimotor properties and affordances greatly vary, leaving only core regions involved in complex visual processing (like the fusiform gyrus). In contrast, language and social processing are at a pinnacle of human functioning and likely require more dynamic hubs orchestrating far more regions given the complexity of these processes. This implies that we need much more consideration of the representations and processes engaged by the stimuli and tasks we give participants and considerably more emphasis on individual differences [150][151][152][153] .

Averaging
To conduct the neuroimaging 'meta-meta-analysis', we manually searched the Neurosynth 154 and BrainMap 155 databases (accessed in August 2022) for available terms and studies related to language representations (e.g., 'phonological', 'words', 'sentences') and associated processes (e.g., 'speech', 'semantics', 'syntax'). We excluded terms and study descriptions pertaining to music and reading or having an obvious visual element. This resulted in 57 Neurosynth terms and 28 BrainMap searches that were used to assemble the meta-meta-analysis (see Table 1). and corresponding coordinates, and the 'GingerALE' application (version 3.0.2) was used to perform activation likelihood meta-analyses on those coordinates 155,[158][159][160] . These were done in MNI space and 8 thresholded using false discovery rate correction of q ≤ 0.01 to match meta-analyses done in Neurosynth.
Each of the resulting 85 meta-analyses was further preprocessed to have a minimum cluster size of 50 voxels (400 microliters). The spatial pattern of activity for the combined Neurosynth meta-analyses was highly correlated with the BrainMap meta-analyses (r = 0.76). Thus, the results were combined into one neuroimage by count using '3dmerge' available in the AFNI software package 161 .
Additionally, we conducted a one-sample 'group level' statistical analysis with 'software package' as a covariate (i.e., Neurosynth or BrainMap) and a minimum of five contributing meta-analyses using '3dttest++', also available in AFNI 161 . Results were thresholded using a false discovery rate correction of q ≤ 0.01 and a minimum cluster size of 50 voxels (400 microliters). We used the online database at neurosynth.org to provide functional descriptions of the resulting clusters of activation (from dataset version 0.7, released July 2018) 154 . Specifically, we give the top 10 significant functional terms at the centre of mass coordinates, excluding anatomical terms like 'heschl gyrus', 'inferior', and 'superior temporal'.

Distributed
To determine the extent of language processing in the brain when not averaging over all representations, the NeuroQuery database was queried for 'verbs' and 'nouns' as an example of differentiable representations that are often averaged over, e.g., as would necessarily be the case in the 'sentences' meta-analysis (accessed, August 2022) 162 . Unlike Neurosynth, NeuroQuery produces 9 meta-analyses that are predictions of where in the brain a study about the input terms is likely to report activity. We used this approach over Neurosynth because it can make use of coordinates from nearly 10 times more articles which would presumably make results more robust for relatively less frequently studied topics. In particular, NeuroQuery produced a meta-analysis from 662 articles for the term 'verbs' (compared to 107 in Neurosynth) and 889 for 'nouns' (compared to 100 in Neurosynth).
Results are presented unthresholded in Figure 2. To determine significant regions of activations reported in the Results section, the meta-analyses were subtracted from each other and thresholded at z = 3.89 (p ≤ 0.0001), excluding any overlap and using a minimum cluster size of 50 vowels (400 microliters). The patterns of activity for these thresholded results were similar to but more robust than the q ≤ 0.01 false discovery rate correct 'verbs' and 'nouns' meta-analyses from Neurosynth.

Hubs
We developed a new meta-analytic network connectivity hub metric to examine whether the resulting 'language regions' from the meta-meta-analysis are hubs 66,67 . First, we used the Neurosynth software package to conduct 165,953 voxel-wise co-activation meta-analyses across 14,371 studies. Similar to the logic of functional connectivity, the regular co-activation of two or more regions is suggestive that those regions form functional connections or a network. Each meta-analysis was thresholded at q ≤ 0.01 false discovery rate corrected and combined by count using '3dmerge' available in the AFNI software package 161 . This corresponds to the number of connections (edges) that a voxel (node) has and, thus, is a meta-analytic equivalent of degree centrality. The resulting map was spatially converted to z-scores using a white matter mask generated by Freesurfer to limit results to grey matter 163,164 . This is presented unthresholded in Figure 3. Hubs were defined as voxels with 3.89 standard deviations more connections than the mean (equivalent to p ≤ 0.0001). Overlaps with meta-meta-analysis results were tested at this z-value as well as z = 3.29 (p ≤ 0.001), z = 2.58 (p ≤ 0.01), and z = 1.96 (p ≤ 0.05).

Neuroimaging
Our fMRI data was derived from 86 participants who watched one of 10 previously unseen full-length films from 10 different genres ( Unless otherwise noted, all preprocessing and data visualisation was done with the AFNI software package and specific programs are indicated where appropriate 161 . First, anatomical images were 10 corrected for intensity non-uniformity, deskulled, and nonlinearly aligned to an MNI template (i.e., MNI152_2009_template_SSW.nii.gz; using '@SSwarper') 156,157 . Next, Freesurfer's 'recon-all' was run with default parameters (version 7.0) 163,164 . This allowed us to segment anatomical images into 11 ventricle and white matter regions of interest to be used to make nuisance regressors.

Regression
We used the AFNI program '3dDeconvolve' to conduct a single duration and amplitude-modulated multiple linear regression model on the preprocessed time series to determine the distribution of brain activity for words and their associated sensorimotor properties. This approach allows us to find a 'main effect' for words as well as determine how the sensorimotor properties of those words modulate the amplitude of their response. To do this, films were first annotated for word onset and duration using a combination of machine learning based speech-to-text translation, dynamic time warping to the subtitles, and manual correction (see) 165 . Next, we found the intersection of each word annotation

LME
The resulting words and 11 sensorimotor amplitude modulated beta coefficient maps from the multiple linear regression were input into a linear mixed effects model using '3dLME' 172 . Age, gender, and film watched for each participant were included as covariates with age centred around the mean. We set participant as a random effect, whereby the intercept of the slope was allowed to vary by a small random amount compared to the group average for each participant. We computed the 'baseline' contrasts for words, all 11 sensorimotor norms, and pairwise contrasts between norms.
The results of these contrasts were corrected for multiple comparisons using a multi-thresholding approach (modelled after) 173 . First, we estimated the smoothness and autocorrelation function of neighbouring voxels using the '3dFWHMx' command in AFNI 174 . Then we ran '3dClustSim' over six uncorrected individual voxel p-values (0.05, 0.02, 0.01, 0.005, 0.002, and 0.001) to achieve an alpha (α) threshold of 0.01. Using the significant cluster sizes whereby faces or edges need to touch and voxels are contiguous if they are either positive or negative at each p-threshold, we merged the thresholded maps at each p-threshold to obtain significant voxels (α = 0.01). We discuss and display all results using a minimum cluster size of 20 voxels (540 microliters).
We conducted a second duration and amplitude modulated regression and group linear mixed-effects model. This constituted an additional analysis to try and understand the lack of inferior frontal gyrus activity for words in the analysis described in this and the prior section. We reasoned that word frequency likely accounts for the lack of activity given it has been directly demonstrated to increase activity in these regions with decreasing frequency [69][70][71] . The analyses were the same as described but did not include the sensorimotor modulators. To have something to compare to, we contrasted word frequency directly with sound energy. This also allowed us to determine whether sound energy accounted for activity in 'lower-level' auditory-related brain regions, providing reassurance that it was serving its role as a nuisance regressor.

Hubs
We next sought to determine if the voxels corresponding to the main effect of words in the 3dLME model correspond to hubs. We did this using measures of centrality, how important a node is for the integrity and information flow within a network. Centrality can be determined using various metrics that provide different information on the role of the node of interest in the network. We measured degree, eigenvector, closeness, and betweenness centrality to ensure that we are robustly characterising centrality across metrics 46,[175][176][177] .
Degree centrality is the sum of inward and outward connections from a node. Eigenvector centrality is a measure of influence on a network, meaning that a high-connectivity node linked to nodes of high connectivity will have higher eigenvector centrality (i.e., be more influential) than a high-connectivity node linked to low-connectivity nodes. Betweenness centrality measures the shortest paths that pass through a given node and closeness centrality measures the inverse of the distance of shortest paths passing through the node. Although these centrality metrics provide different details on the importance of a node, they are correlated to one another 177,178 .
We calculated these four measures of centrality on a voxel-wise basis and in a dynamic manner. To do this, we constructed time-varying connectivity matrices using a sliding-window approach. First, the time series was resampled from three to a four mm 3 resolution (using '3dresample') due to the lengthy time calculated for analyses at the native three mm resolution, considering the available computational resources. Resampled time series were then divided into windows of one-minute length, sliding every five seconds to allow for a 55-second overlap between one window and the subsequent window. Next, a pairwise Pearson's product-moment correlation coefficient was computed on a voxel-wise basis at each window using the AFNI program '3dDegreeCentrality' 174 . Finally, the resulting correlation matrices were proportionally thresholded to obtain a 5% sparsity in each time window. These values were considered a connection between two voxels and used to build connectivity matrices which were used to calculate the centrality values described.
Despite some differences, these four centrality metrics had a significant Spearman's ranking correlation with one another at the group level (i.e., across time windows and participants; M ρ = 0.94, SD ρ = 0.02; p ≤ 0.001). Thus, to simplify analyses and conform to our hubs/periphery hypotheses, we clustered the four centrality values using Ward's minimum variance 78 , which consistently divided voxels in each time window into two groups (M = 2.00, SD = 0.02) across all participants, one high-centrality (labelled 1) and one low-centrality cluster (labelled 2). In rare cases, the clustering method detected greater than two clusters of centrality, but since the vast majority of windows divided centrality values into two groups, we recomputed the few outlier time windows by forcing them to split the data into two clusters to investigate spatial variations in connectivity over time.
Next, we divided the resulting windowed centrality time series into a high and a low centrality time series for each participant and averaged across all windows. The resulting maps were used to conduct a linear mixed effects model with '3dLME' (described above) 172 . The fixed effects were centrality (high and low), age, gender, and film with participant as a random effect. A correction for multiple comparisons using a cluster-size correction with multiple voxel-wise thresholds was applied (again, as described above, using α = 0.01).
We then submitted the full windowed high and low centrality time series from all participants to a group spatial independent component analysis (ICA) using the default parameters of 'MELODIC' (version 3.15), reducing to a 100-dimensional subspace 179 . These results were then used in a dual 14 regression to estimate a version of the 100 resulting components for each participant so group statistics could be calculated. Specifically, this approach first regresses the group spatial maps into each participant's time series to give a set of timecourses. These timecourses are then regressed onto the time series to get 100 participant-specific spatial maps for both high and low centrality. These were then directly contrasted using a t-test and thresholded using a voxel-wise z-value equivalent to p ≤ 0.0001 (i.e., 0.01 divided by the number of components) and a minimum cluster threshold of 20 voxels (540 microliters) to protect against multiple comparison issues. Finally, we computed the spatial correlation of each of these contrasts with words from the above 3dLME analysis to locate a hub or hubs corresponding to words.
We also averaged each time window across participants for '500 Days of Summer' and 'Citizenfour' separately to conduct two additional MELODIC ICAs, again with 100 dimensions, with resulting components thresholded using a mixture modelling approach 179 . We did not do this on the other eight movies because they had only six participants each. We also conducted a similar ICA analysis on the centrality maps directly thresholded at 90% to define high centrality regions across all 86 participants.
This was done to provide evidence that results were not dependent on the more categorical high/low centrality clustering method used for all of the primary analyses. All other analyses are as described in the Results section. For display purposes, we also conducted affinity propagation clustering as described in the Supplementary Figure S2 and S3 captions 180,181 .