Elsevier

Brain and Language

Volume 162, November 2016, Pages 42-45
Brain and Language

Short Communication
A common misapplication of statistical inference: Nuisance control with null-hypothesis significance tests

https://doi.org/10.1016/j.bandl.2016.08.001Get rights and content

Highlights

  • Researchers use statistical tests of stimulus or subjects characteristics to “control for confounds”.

  • This practice is conceptually misguided and pragmatically useless.

  • We discuss the problem and alternatives.

Abstract

Experimental research on behavior and cognition frequently rests on stimulus or subject selection where not all characteristics can be fully controlled, even when attempting strict matching. For example, when contrasting patients to controls, variables such as intelligence or socioeconomic status are often correlated with patient status. Similarly, when presenting word stimuli, variables such as word frequency are often correlated with primary variables of interest. One procedure very commonly employed to control for such nuisance effects is conducting inferential tests on confounding stimulus or subject characteristics. For example, if word length is not significantly different for two stimulus sets, they are considered as matched for word length. Such a test has high error rates and is conceptually misguided. It reflects a common misunderstanding of statistical tests: interpreting significance not to refer to inference about a particular population parameter, but about 1. the sample in question, 2. the practical relevance of a sample difference (so that a nonsignificant test is taken to indicate evidence for the absence of relevant differences). We show inferential testing for assessing nuisance effects to be inappropriate both pragmatically and philosophically, present a survey showing its high prevalence, and briefly discuss an alternative in the form of regression including nuisance variables.

Introduction

Methods sections in many issues of Brain & Language and similar journals feature sentences such as

Animate and inanimate words chosen as stimulus materials did not differ in word frequency (p > 0.05).

Controls and aphasics did not differ in age (p > 0.05).

In the following, we discuss the inappropriateness of this practice. A common problem in brain and behavioral research, where the experimenter cannot freely determine every stimulus and participant characteristic, is the control of confounding/nuisance variables. This is especially common in studies of language. Typically, word stimuli cannot be constructed out of whole cloth, but must be chosen from existing words (which differ in many aspects). Stimuli are processed by subjects in the context of a rich vocabulary; and subject populations have usually been exposed to very diverse environments and events in their acquisition of language. A similar problem exists, for example, when comparing controls to specific populations, such as bilingual individuals or slow readers. The basic problem researchers are faced with is then to prevent reporting e.g. an effect of word length, or bilingualism, when the effect truly stems from differences in word frequency, or socioeconomic status, which may be correlated with the variable of interest. A prevalent method we find in the literature, namely inferential null hypothesis significance testing (NHST) of stimuli, fails to perform the necessary control.

Often, researchers will attempt to demonstrate that stimuli or participants are selected so as to concentrate their differences on the variable of interest, i.e. reduce confounds, by conducting null-hypothesis testing such as t-tests or ANOVA on the potential confound in addition to or even instead of showing descriptive statistics in the form of measures of location and scale. The underlying intuition is that these tests establish whether two conditions differ in a given aspect and serve as proof that the conditions are “equal” on it. This is, in turn, based on the related, but also incorrect intuition that significance in NHST establishes that a contrast shows a meaningful effect, and the related issue that non-significant tests indicate the absence of meaningful effects.

In practice, we find insignificant tests are used as a necessary (and often sufficient) condition for accepting a stimulus set as “controlled”. This approach fails on multiple levels.

  • Philosophically, these tests are inferential tests being performed on closed populations, not random samples of larger populations. Statistical testing attempts to make inferences about the larger population based on randomly selected samples. Here, the “samples” are not taken randomly, and we are not interested in the population they are drawn from, but in the stimuli or subjects themselves. For example, in a study on the effects of animacy in language processing, we do not care whether the class of animate nouns in the language is on average more frequent than the class of inanimate nouns. Instead, we care whether the selection of animate nouns in our stimuli are on average more frequent than the selection of inanimate nouns in our stimuli. But inferential tests answer the former question, not the latter. Tests refer to the population of stimuli that will largely not be used, or the population of subjects that will not be investigated in the study.

  • Pragmatically, beyond being inappropriate, this procedure does not test a hypothesis of interest. This procedure tests the null hypothesis of “the populations that these stimuli were sampled from do not differ in this feature”, but what we are actually interested in is “the differences in this feature between conditions is not responsible for any observed effects”. In other words, this procedure tests whether the conditions differ in a certain respect to a measurable degree, but not whether that difference actually has any meaningful influence on the result.

  • Additionally, these tests carry all the usual problems of Null Hypothesis Significance Testing (cf. Cohen, 1992), including its inability to “accept” the null hypothesis directly. This means that even if the conditions do not differ significantly, we cannot accept the hypothesis that they do not differ; we can only say that there is not enough evidence to exclude this hypothesis (which we are not actually interested in). In typical contexts (e.g. setting the Type I rate to the conventional 5% level), the power to reject the null hypothesis of no differences is low (Button et al., 2013) due to a small number of items, meaning that even comparatively large differences may be undetected, while in larger sets, even trivially small differences may be rejected. Especially with small samples (e.g., 10 subjects per group, or 20 items per condition), the probability of detecting moderate confound effects is thus low – even if there are substantial differences, tests will not reject the null hypothesis, and stimulus sets might be accepted as being balanced based on a test with a low probability of rejecting even moderately imbalanced samples of such a size.

In other words, these tests are incapable of actually informing us about the influence of potential confounds, but may give researchers a false sense of security. This inferential stage offers no benefit beyond examining the descriptive measures of location and scale (e.g. mean and standard deviation) and determining if the stimuli groups are “similar enough”. For perceptual experiments, there may even be established discrimination thresholds below which the differences are considered indistinguishable. A preferred approach is directly examining to what extent these potential confounds have an influence on the results, such as by including these confounds in the statistical model. This is often readily implemented via multiple regression, particularly “mixed-effect” approaches (Fox, 2016, Gelman and Hill, 2006).

In the context of baseline differences between treatment and control groups in clinical trials, a similar debate has been waged (e.g. Senn, 1994) under the term “randomization check” as it refers to checking if assignment of subjects to treatments has truly been performed randomly. In interventional clinical trials, assignment can indeed be truly random (unlike in the kind of study in brain and behavioral sciences we are referring to here). Yet even here, inferential tests have been judged inappropriate for achieving their intended aims. Nonetheless, the clinical trial literature provides important considerations for experimental design choices, e.g. the proper way of blocking and matching (Imai, King, & Stuart, 2008), and can thus inform preparing stimulus sets or participant groups even for non-clinical experiments.

Section snippets

Prevalence

We performed a literature survey of neurolinguistic studies to estimate the prevalence of inferential tests of nuisance variables (see below for further details).

Simulation

We performed a simulation to investigate the impact of inferential tests of confounding variables. In particular, we find that when the correlation between the confounding covariate and the outcome measure is not perfect, testing covariates (instead of their effect on the outcome variable) can lead to unnecessary rejections of manipulations as “confounded” in 50% or more of studies for even large effects.

Discussion and recommendation

In sum, NHST control of nuisance variables is prevalent and inappropriate, based on a flawed application of statistics to an irrelevant hypothesis. Proper nuisance control (of known and measurable variables) is not complex, although it can require more effort and computer time.

Researchers should still use descriptive statistics to demonstrate the success of balancing. That is, quantifying e.g. differences between stimuli via variances, raw means and standardized means (Cohen’s d), and

Survey

The analysis was restricted to current volumes. For all articles published by B&L from 2011 to the 3rd issue of 2013, three raters (not blinded to the purpose of the experiment) investigated all published experimental papers (excluding reviews, simulation studies, editorials etc.). For each experiment reported in a study, the stimulus/materials sections were investigated for descriptive and inferential statistics derived from populations that were exhaustively sampled without error. If a

Acknowledgements

We thank Sarah Tune for helpful discussion and Tal Linzen for bringing to our attention the randomization check literature; Katja Starikova, Miriam Burk and Antonia Götz are to be thanked for collecting the survey data. This work was supported in part by the German Research Foundation (BO 2471/3-2) and by the European Research Council (ERC) Grant 617891.

References (12)

There are more references available in the full text version of this article.

Cited by (54)

  • Differential temporo-spatial pattern of electrical brain activity during the processing of abstract concepts related to mental states and verbal associations

    2022, NeuroImage
    Citation Excerpt :

    Firstly, although MST and VAS abstract concepts were closely matched for a variety of conceptual and linguistic variables, there were some slight non-significant differences in particular with regard to lemma frequency and number of letters. Note that the use of statistical tests to evaluate the matching of word sets, a common procedure in psycholinguistic research, has been recently criticized for various reasons, in particular because it rests on the statistically problematic acceptance of the null hypothesis (Sassenhagen and Alday, 2016). Furthermore, the variation of feature content in the present word sets with regard to mental states and verbal associations is gradual or subtle, and the number of stimuli per category (n = 30) is relatively low.

  • Anticipating words during spoken discourse comprehension: A large-scale, pre-registered replication study using brain potentials

    2020, Cortex
    Citation Excerpt :

    In the original studies, which used ANOVAs, the absence/presence of inflection was approximately balanced across items. In the current study, however, this factor is explicitly accounted for in the model (see Sassenhagen & Alday, 2016), using a more powerful analysis that simultaneously takes into account sources of variance (subjects, items, presence of inflection) that were not included in the original study's ANOVAs. This is important because, not only may the effect of match differ for different-gender adjectives (see the pre-registered exploratory analysis), unaccounted-for variation that is orthogonal to the effect of interest (e.g., random intercept variation) can reduce power, while unaccounted-for variation that is confounded with our effect of interest (e.g., random slope variation) can drive differences between means, with increased risk of false positives and overestimation of effect size (for discussion, see Barr, Levy, Scheepers, & Tily, 2013).

View all citing articles on Scopus
View full text