Limits on prediction in language comprehension: A multi-lab failure to replicate evidence for probabilistic pre-activation of phonology

Mante S. Nieuwland; Stephen Politzer-Ahles; Evelien Heyselaar; Katrien Segaert; Emily Darley; Nina Kazanina; Sarah Von Grebmer Zu Wolfsthurn; Federica Bartolozzi; Vita Kogan; Aine Ito; Diane Mézière; Dale J. Barr; Guillaume Rousselet; Heather J. Ferguson; Simon Busch-Moreno; Xiao Fu; Jyrki Tuomainen; Eugenia Kulakova; E. Matthew Husband; David I. Donaldson; Zdenko Kohút; Shirley-Ann Rueschemeyer; Falk Huettig

doi:10.1101/111807

Abstract

In the last few decades, the idea that people routinely and implicitly predict upcoming words during language comprehension has turned from a controversial hypothesis to a widely-accepted assumption. Current theories of language comprehension^1–3 posit prediction, or context-based pre-activation, as an essential mechanism occurring at all levels of linguistic representation (semantic, morpho-syntactic and phonological/orthographic) and facilitating the integration of words into the unfolding discourse representation. The strongest evidence to date for phonological pre-activation comes from DeLong, Urbach and Kutas⁴, who monitored participants’ electrophysiological brain responses as they read sentences, presented one word at a time, with expected/unexpected indefinite article + noun combinations like, “The day was breezy so the boy went outside to fly a kite/an airplane”. The sentences varied expectations (‘cloze’ probability) for a consonant- or vowel-initial noun, as determined in a sentence-completion task using other participants. Expectedly, the amplitude of the N400 event-related potential (ERP) decreased (became less negative) with increasing cloze reflecting ease of processing^5–6. Whereas the decreased N400 at the noun could be due to its pre-activation or because high-cloze nouns are easier to integrate, crucially, N400s at the immediately-preceding article a or an showed the same relationship with cloze, i.e., encountering an indefinite article that mismatched a highly-expected word (e.g., an when expecting kite) also elicited a larger N400. This led to the claim that participants pre-activated highly-expected nouns, including their initial phonemes, based on the preceding context, with larger N400s on mismatching articles reflecting disconfirmation of this prediction.

The Delong et al. study warranted stronger conclusions than related results available at the time. Unlike previous work, it did not rely on the precursory visual-depiction of upcoming nouns, clearly de-confounded prediction and integration effects, and tested for graded phonological pre-activation of specific word form. Correspondingly, the study has been enthusiastically received as strong evidence for probabilistic phonological pre-activation, receiving over 650 citations to date and featuring in authoritative reviews^2–3. However, there is good cause to question the soundness of the original finding (and the appropriateness of the analysis used). Attempts to replicate the critical article-effect have failed⁷. Moreover, an earlier, alternative analysis of the same data by the authors⁸ failed to reach statistical significance, but was omitted from the published report.

To obtain more definitive evidence, we conducted a direct replication study spanning ^{9 laboratories (N_total = 334). We pre-registered one replication analysis that was faithful to the} original, and one single-trial analysis that modeled subject- and item-level variance using linear mixed-effects models. Applying the replication analysis to our article data (Figure 1a), the original finding did not replicate: no laboratory observed a significant negative relationship between cloze and N400 at central-parietal electrodes. In contrast, the negative relationship was successfully replicated for the nouns: 6 laboratories observed such an effect and 2 laboratories observed relatively strong but non-significant effects in the expected direction (range r = .30 to .50). In the single-trial analysis (Fig. 1b-c), there was no statistically significant effect of cloze on article-N400s, also with stricter control for pre-article voltage levels (Supplementary Fig. 1). Crucially, there was a strong and significant cloze effect on noun-N400s (in all laboratories), which was significantly different from that on article-N400s. We observed no significant differences between laboratories for article or noun effects. Exploratory Bayesian analyses with priors based on DeLong et al. further support our conclusions (Fig. 1d, Supplementary Fig. 2). Finally, a control experiment confirmed our participants’ sensitivity to the a/an rule during online language comprehension (Supplementary Fig. 3).

Figure 1

A multi-lab failure to replicate evidence for probabilistic pre-activation of phonology. (a) Pre-registered replication analysis: Pearson’s r correlations between ERP amplitude and article/noun cloze probability per EEG channel (* P < 0.05) and per laboratory. (b, c) Pre-registered single-trial analysis: (b) Grand-average ERPs elicited by relatively expected and unexpected words (cloze higher/lower than 50%) at electrode Cz, with standard deviation are shown in dotted lines, and (c) the relationship between cloze and N400 amplitude as illustrated by the mean ERP values per cloze value (number of observations reflected in circle size), along with the regression line and 95% confidence interval. A change in article cloze from 0 to 100 is associated with a change in amplitude of 0.296 µV (95% confidence interval: −.08 to .67), χ²(1) = 2.31, p = .13. A change in noun-cloze from 0 to 100 is associated with a change in amplitude of 2.22 µV (95% confidence interval: 1.75 to 2.69), χ²(1) = 56.5, p < .001. The effect of cloze on noun-N400s was statistically different from its effect on article-N400s, χ²(1) = 31.38, p < .001. (d) Bayes factor analysis associated with the replication analysis, quantifying the obtained evidence for the null hypothesis (H₀) that N400 is not impacted by cloze, or for the alternative hypothesis (H₁) that N400 is impacted by cloze with the size and direction of effect reported by DeLong et al. Scalp maps show the common logarithm of the replication Bayes factor for each electrode, capped at log(100) for presentation purposes. Electrodes that yielded at least moderate evidence for or against the null hypothesis (Bayes factor of ≥ 3) are marked by an asterisk. At posterior electrodes where DeLong et al. found their effects, our article data yielded strong to extremely strong evidence for the null hypothesis, whereas our noun data yielded extremely strong evidence for the alternative hypothesis (upper graphs). These results were also found when applying a 500 ms pre-word baseline correction (lower graphs).

Despite a sample size 10 times larger than the original and improved statistical analysis, we observed no statistically significant effect of cloze on article-N400s, while replicating the strong and statistically significant effect of cloze on noun-N400s^4,6. The effect of cloze on article-N400s, if existent, must be very small to evade detection given our expansive approach. Whether such an effect would constitute convincing evidence for routine phonological pre-activation as assumed in theories of language comprehension³ can be questioned, but, more generally, such an effect cannot be meaningfully studied in typical small-scale studies. Consequently, current theoretical positions may be based on potentially unreliable findings and require revision. In particular, the strong prediction view that claims that pre-activation routinely occurs across all – including phonological – levels³, can no longer be viewed as having strong empirical support.

Our results do not constitute evidence against prediction in general. We note a lack of convincing evidence specifically for phonological pre-activation, which would have to be measured before a noun appears and unobscured by processes instigated by the noun itself.

However, our results neither support nor necessarily exclude phonological pre-activation. Unlike gender-marked articles⁹ (e.g., in Dutch or Spanish) that agree with nouns irrespective of intervening words, English a/an articles index the subsequent word, which is not always a noun. Maybe our participants did not use mismatching articles to disconfirm predicted nouns, possibly because it was not a viable strategy (American and British English corpus data show a mere 33% chance that a noun follows such articles). Perhaps a revision of the predicted meaning is required to trigger differential ERPs.

DeLong et al. recently described filler-sentences in their experiment^{10, cf. 7}, which were omitted from their original report, and were neither provided nor mentioned to us upon our request for their stimuli. DeLong used the existence of these filler-sentences to dismiss an alternative explanation of their results, namely that an unusual experimental context wherein every sentence contains an article-noun combination leads participants to strategically predict upcoming nouns. Importantly, we failed to replicate their article-effects despite an experimental context that could inadvertently encourage strategic prediction. Therefore, the difference between their experiment and ours cannot explain the different results, and may even strengthen our conclusions.

In sum, our findings do not support a strong prediction view involving routine and probabilistic pre-activation of phonological word form based on preceding context.

Moreover, our results further highlight the importance of direct replication, large sample size studies, transparent reporting and of pre-registration to advance reproducibility and replicability in the neurosciences.

Limits on prediction in language comprehension: A multi-lab failure to replicate evidence for probabilistic pre-activation of phonology

Abstract

Citation Manager Formats

Subject Area