## Abstract

We have empirically assessed the distribution of published effect sizes and estimated power by extracting more than 100,000 statistical records from about 10,000 cognitive neuroscience and psychology papers published during the past 5 years. The reported median effect size was d=0.93 (inter-quartile range: 0.64-1.46) for nominally statistically significant results and d=0.24 (0.11-0.42) for non-significant results. Median power to detect small, medium and large effects was 0.12, 0.44 and 0.73, reflecting no improvement through the past half-century. Power was lowest for cognitive neuroscience journals. 14% of papers reported some statistically significant results, although the respective F statistic and degrees of freedom proved that these were non-significant; p value errors positively correlated with journal impact factors. False report probability is likely to exceed 50% for the whole literature. In light of our findings the recently reported low replication success in psychology is realistic and worse performance may be expected for cognitive neuroscience.

## Introduction

Low power and selection biases, questionable research practices and errors favouring the publication of statistically significant results have been proposed as major contributing factors in the reproducibility crisis that is heavily debated in many scientific fields (Nosek et al. 2015a,b; Ioannidis et al. 2014a,b; Ioannidis 2005). Here, we aimed to get an impression about the latest publication practices in the closely related cognitive neuroscience and (mostly experimental) psychology literature. To this end we extracted more than 100,000 records of degrees of freedoms, t values, F values and p values from papers published between Jan 2011 to Aug 2014 in 18 journals. Journal impact factors ranged from 2.367 (Acta Psychologica) to 17.15 (Nature Neuroscience). The data allowed us to assess the distribution of published effect sizes; to estimate the power of studies; to estimate the lower limit of false report probability and to confirm recent reports about p-value reporting errors. The text-mining approach we used enabled us to conduct a larger power survey than classical studies.

Low power is usually only associated with failing to detect existing (true) effects and therefore with wasting research funding on studies which a priori have low chance to achieve their objective. However, low power also has two other serious negative consequences: it results in the exaggeration of measured effect sizes and it also boosts false report probability, the probability that statistically significant findings are false (Ioannidis 2005; Button et al. 2013; Pollard and Richardson, 1987; Berger and Sellke, 1987).

First, if we use Null Hypothesis Significance Testing (NHST) then published effect sizes are likely to be, on average, substantially exaggerated when most published studies in a given scientific field with low power (Button et al. 2013; Sedlmeier and Gigerenzer, 1989; Rossi 1990; Cohen 1962; Hallahan and Rosenthal, 1996; Note that NHST is amalgamation of Fisher's significance testing method and the Nyman-Pearson theory. However, the concept of power is only interpreted in the Nyman-Pearson framework. For extended discussion see Gigerenzer and Marewski, 2015; Gigerenzer et al. 1990) because even if we assume that there is a fix true effect size, actual effect sizes measured in studies will have some variability due to sampling error and measurement noise. Underpowered studies will be able to classify as statistically significant only the occasional large deviations from real effect sizes; conversely, most measured effects will remain under the statistical significance threshold even if they reflect true relationships (Schmidt, 1992; Sterling et al. 1995; Ioannidis, 2008; see **Supplementary Figure 1A**). Because many “negative” findings are not published, even metaanalyses may exaggerate effects. Effect size inflation is greater, when studies are even more underpowered.

Second, regarding the NHST framework the long run False Report Probability (FRP) can be defined the probability that the null hypothesis is true when we get a statistically significant finding. The long run True Report Probability (TRP) can be defined as the probability that the alternative hypothesis is true when we get a statistically significant finding. Computationally, FRP is the number of statistically significant false positive findings divided by the total number of statistically significant findings. TRP is the number of statistically significant truly positive findings divided by the total number of statistically significant findings. FRP and TRP can be computed by applying Bayes theorem (see **Supplementary Material Section 4** for details and note that the concepts of FRP and TRP do not exist in the NHST framework.). In a simple case FRP can be computed as
where O stands for H_{0}:H_{1} odds and α denotes the statistical significance level which is nearly always α=0.05. So, for given values of O and α FRP is higher if power is low. As in practice O is very difficult to ascertain, high power provides the most straightforward ‘protection’ against excessive FRP (Ioannidis 2005; Button et al. 2013; Pollard and Richardson, 1987).

Because published effect sizes are likely to be inflated, it is most informative to determine the power of studies to detect pre-defined effect sizes. Hence, first we computed power from the observed degrees of freedom using supporting information from manually extracted records to detect effect sizes traditionally considered small (d=0.2), medium (d=0.5) and large (d=0.8) (Sedlmeier and Gigerenzer, 1989; Rossi 1990; Cohen 1962). Second, we also computed power to detect the reported effect sizes which were published across studies. Given that many of these effect sizes are likely to be inflated compared to the true ones (as explained above), this enabled us to estimate the lower limit of FRP (Ioannidis, 2005; Berger and Sellke, 1987). Further, we also examined p value reporting errors in the cognitive neuroscience literature as recent studies found that 12.5% to 20% of psychology papers have major p value reporting errors which affect judgments about statistical significance (Bakker and Wicherts, 2011; Nuijten et al. 2015; Veldkamp et al. 2014).

## Results and Discussion

The highly satisfactory performance of the extraction algorithm is reflected by the similarity between the distributions of automatically and manually extracted degrees of freedoms (which reflect the sample sizes of the studies, e.g. for an independent sample t-test, the degrees of freedom are the sample size minus 2) (**Fig. 1A**). During the validation process we assessed the proportion of one-sample, matched and two-sample t-tests which enabled us to use a mixture model to compute published effect sizes and power. The distribution of the computed effect sizes showed a near perfect match to the effect size distribution determined from records where D values were available which suggests that the mixture model we used is likely to well approximate the proportion of one-sample and two-sample t-tests (**Fig. 1B**). The computed D value distribution was slightly shifted to the right relative to the reported D value distribution, but both the medians (computed = 0.653; reported = 0.660) and means (computed = 0.938; reported=0.889) were very similar.

**Fig. 1C.** shows the bivariate distribution of D values and degrees of freedoms and represents the mean and median effect sizes for statistically significant and non-significant records and for the whole data-set. The large discrepancy between statistically significant and non-significant results is clear (medians, 25th and 75th quantiles for statistically significant and non-significant D values, respectively: d=0.932 [0.637-1.458]; d=0.237 [0.106-0.421]). **Fig. 1D.** shows the t value distribution associated with effect sizes. It is evident that a very large proportion of non-significant results are not reported at all (compare the shape of the distribution to the expected shapes shown in **Supplementary Figure 1B** and also see **Supplementary Figure 2B** for the distribution of p values).

For a certain effect size power is determined by sample size which determines degrees of freedom. The median of degrees of freedom was 20 for statistically significant and 19 for statistically non-significant results (mode=15 for both). Subfields showed large differences in degrees of freedom with most records having much lower degrees of freedom in cognitive neuroscience than in psychology and medicine (**Fig 1E.** 25th and 75th centiles for all records for cognitive neuroscience journals: df=10-28; psychology: df=17-60; medical journals: df=15-54). Reported effect sizes also differed markedly by subfields (25th and 75th centiles for all records for cognitive neuroscience journals: d=0.34-1.22; psychology: d=0.29-0.96; medical journals: d=0.23-0.91). The larger reported effect sizes in cognitive neuroscience may well be the consequence of effect size inflation due to having less degrees of freedom as the following power analyses suggest.

Median degrees of freedom and median effect sizes for each journal are depicted in **Fig 2A.** The blue crosses on the left show values for non-significant records. The red (cognitive neuroscience), green (psychology) and orange (medical) crosses show effect sizes for statistically significant records. Taking into account the reported degrees of freedom we computed power (at α=0.05) for effect sizes which are traditionally considered low (d=0.2), medium (d=0.5) and high (d=0.8) (Rossi, 1990; Sedlmeier and Gigerenzer, 1989; Cohen, 1962).

The cumulative probability of records reaching a certain level of power is shown in **Fig. 2B.** Median and mean power are shown in **Table 1.** It is apparent that cognitive neuroscience studies had the lowest level of power. For example, to detect a small effect (d=0.2), 90% of cognitive neuroscience records had power < 0.234 which is much worse chance to detect a true effect than relying on throwing up a coin (Sedlmeier and Gigerenzer, 1989). Importantly, the somehow higher power in the journals we classified as more medically oriented was driven by the Journal of Psychiatry Research (JPR in **Fig. 2A;** median power to detect small, medium and large effects: 0.23, 0.74, 0.86) which includes more behavioural studies than the other two journals we classified as ‘medical’. These other two journals, more focused on neuroimaging, still performed better than cognitive neuroscience journals and at about the same level as psychology journals (median power to detect small, medium and large effects: 0.14, 0.53, 0.78).

A comparison to prominent older surveys of power estimates 25 and >50 years ago showed that median power in psychology journals have increased to detect medium sized effects but remained about the same for small and large effects (see **Table 1**; Rossi, 1990; Sedlmeier and Gigerenzer, 1989; Cohen, 1962). Power for cognitive neuroscience and for all subfields together was lower than median and mean power reported in 1962, more than half a century ago^{9}. Journal impact factors negatively correlated with median power (correlation for small, medium and large effect sizes, respectively with 95% accelerated and bias corrected bootstrap confidence intervals [10^{5} permutations]: r=-0.42 [-0.63; -0.09]; -0.46 [-0.71; -0.09]; -0.45 [-0.77; -0.02]) due to the fact that cognitive neuroscience journals had the highest impact factors in our sample, but the worst power estimates.

FRP depends on power, the odds of true H_{0} to H_{1} data and on reporting bias (Ioannidis, 2005; Berger and Sellke, 1987). The continuous lines in **Fig. 3** estimate lower limits for FRP, using the probably highly inflated effect sizes computed from published t values to calculate power, for various H_{0}:H_{1} odds and bias values and for α = 0.05. In the best but unrealistic case of having H_{0}:H_{1} odds=1:1=1 and zero bias, FRP is 13.5%. A 10% bias pushes this to 23%. Staying in the still optimistic zone when every second to every 6th of hypotheses work out (1 ≤ H_{0}:H_{1} odds ≤ 5) and with relatively modest 10-30% experimenter bias, FRP is 23-71% (median = 51%). That is, between one to three quarters of statistically significant results will be false positives. If we now move into the domain of more exploratory research where even more experimental ideas are likely to be false (5 < H_{0}:H_{1} odds < 20; bias = 10-30%) then FRP grows to at least 60-91% (median = 77%).

When comparing reported to computed p values from F tests, we found that in 2.3% (n=1546 of 68,762) of all reported results with original p value reports from 14% of papers (879 papers), the reported p values were statistically significant, but when we computed the p values given the provided degrees of freedom and F statistic, these were actually nonsignificant (e.g. a study gave 2 and 45 as nominator and denominator degrees of freedom for an F-test statistic of 3.13 and reported the p-value as p≤0.05, while with these values, the p-value is p=0.0534). This finding is very similar to previous such estimates (Bakker and Wicherts, 2011; Nuijten et al. 2015; Veldkamp et al. 2014). In 10% of the above erroneous reports the p value error was larger than 0.01 and in about 20% of cases it was larger than 0.006 (see **Supplementary Figure 2)**. In 15% of papers with reporting errors affecting nominal significance more than 25% of all p value records were erroneous so that they were significant rather than non-significant. Conversely, in 0.7% (n=487) of all reported results from 323 papers, the computed p values were significant but reported p values were non-significant. All journals and all three science subfields were equally affected by reporting errors and both the proportion of articles with more favorable p value records (r=0.635 [95% bootstrap confidence interval: 0.17-0.87]) and the proportion of more favourable p value records (r=0.572 [95% bootstrap confidence interval: 0.10-0.93]) positively correlated with journal impact factors (see further details and results in **Supplementary Material, Section 5**).

## General discussion

The probability of research hypotheses depends on H_{0}:H_{1} odds, experimenter bias and power (Pollard and Richardson, 1987; Berger and Sellke, 1987; Ioannidis, 2005). H_{0}:H_{1} odds are inherent to each research field, and some biases are more recalcitrant and difficult to control than others, while power can in principle be easily increased by increasing sample size. However, contrary to its importance for the economic spending of research funding, the accurate estimation of effect sizes and minimizing FRP our data suggest that power in cognitive neuroscience and psychology papers is stuck at an unacceptably low level. Overall power has not improved during the past half century (Cohen 1962; Sedlmeier and Gigerenzer, 1989; Rossi, 1990). Results are also similar to other fields, such as behavioural ecology where power to detect small and medium effects was 0.13-0.16 and 0.4-0.47, respectively (Jennions and Moller, 2003).

We actually found that cognitive neuroscience journals had much lower power levels than more psychologically and medically oriented journals. This confirms previous similar inference asserting that FRP is likely to be high in the neuroimaging literature (Button et al. 2013; Yarkoni 2009). This phenomenon can appear for a number of reasons. First, neuroimaging studies and other studies using complex and highly sophisticated measurement tools in general tend to require more expensive instrumentation than behavioural studies and both data acquisition and analysis may need more time investment and resources per participant. This keeps participant numbers low. A related issue is that science funders may have reluctance to fund properly powered but expensive studies. Second, data analysis is highly technical and it can be very flexible, and many analytical choices have to be made on how exactly to analyse the results. This allows for running a very high number of undocumented and sometimes poorly understood and difficult to replicate idiosyncratic analyses influenced by a large number of arbitrary ad hoc decisions which, in their entirety, may be able to generate statistically significant false positive results with high frequency (Carp 2012; Vul et al. 2009; Kriegeskorte, 2009; Uttal 2012), especially when participant numbers are low. It is also important to consider that complicated instrumentation and (black box) analysis software is now more available but training may not have caught up with this wider availability. Third, in relation to more medical journals the stakes at risk are probably lower in cognitive neuroscience (no patients will die, at least not immediately) which may also allow for more biased publications. Nevertheless, generalizations need to be cautious, since there can be large variability in the extent of these potential biases within a given subfield. Some teams and subfields may have superb, error-proof research practices, while others may have more frequent problems. The power failure of the cognitive neuroscience literature is even more notable as neuroimaging (‘brain based’) data is often perceived as ‘hard’ evidence lending special authority to claims even when they are clearly spurious (Weisberg, 2008).

The reporting of p values as significant when they are not (even when juxtaposed against the other statistics form the same analysis) are probably only a modest contributor in the greater drive to generate and report statistically significant results. Most of the significance chasing would not be possible to detect through such inconsistencies. The high prevalence of reported significant results (amounting to three quarters of the reported p values and related statistics) it unrealistically high and consistent with strong selective reporting biases. This has been demonstrated also in distributions of p values reported in abstracts and full texts of biomedical papers (Chavalarias, et al. 2016). The positive correlation between p value reporting errors and journal impact factors suggests that such errors are more prevalent in ‘high impact’ journals and these journals also had less power in the studies that they reported.

Some limitations need to be mentioned for our study. First, given the large scale automation, we cannot fully verify whether the extracted data reflect primary, secondary or even trivial analyses in each paper. In the absence of pre-registered protocols, however, this is extremely difficult to judge even when full papers are examined. Evaluation of biomedical papers suggests that many reported p-values, even in the abstracts, are not pertinent to primary outcomes (Ioannidis 2014). Second, some types of errors, such as non-differential misclassification, may lead to deflated effect sizes. However, in the big picture, with very small power, inflation of the statistically significant effects, is likely to be more prominent than errors reducing the magnitude of the effect size. Third, given the large scale automated extraction, we did not record information about characteristics of the published studies, e.g. study design. It is likely that studies of different designs (e.g. experimental versus observational studies) may have different distribution of effect sizes, degrees of freedom, and power even within the same sub-discipline.

In all, the combination of low power, selective reporting and other biases and errors that we have documented in this large sample of papers in cognitive neuroscience and psychology suggest that high FRP are to be expected in these fields. The low reproducibility rate seen for psychology experimental studies in the recent Open Science Collaboration (Nosek et al. 2015a) is congruent with the picture that emerges from our data. Our data also suggest that cognitive neuroscience may have even higher FRP rates, and this hypothesis is worth evaluating with focused reproducibility checks of published studies. Regardless, efforts to increase sample size, and reduce publication and other biases and errors are likely to be beneficial for the credibility of this important literature.

## Materials and Methods

### Data extraction and validation

We extracted statistical information from cognitive neuroscience and psychology papers published as pdf files. We sampled a range of 18 journals frequently cited in cognitive neuroscience and psychology. Our aim was to collect data on the latest publication practices. To this end, we analysed four years’ regular issues of all journals published between Jan 2011 to Aug 2014.

We categorized 10 journals as focused more on (cognitive) neuroscience (Nature Neuroscience, Neuron, Brain, The Journal of Neuroscience, Cerebral Cortex, NeuroImage, Cortex, Biological Psychology, Neuropsychologia, Neuroscience) and 5 journals focused more on psychology (Psychological Science, Cognitive Science, Cognition, Acta Psychologica, Journal of Experimental Child Psychology). We also searched three more medically oriented journals which are nevertheless often cited in cognitive neuroscience papers so as to increase the representativeness of our sample (Biological Psychiatry, Journal of Psychiatric Research, Neurobiology of Ageing). Journal impact factors ranged from 2.367 (Acta Psychologica) to 17.15 (Nature Neuroscience). 5-year impact factors were considered as reported in 2014 (see **Supplementary Table 1**).

We extracted statistical information about t tests and F tests (t values, F values, degrees of freedoms, p values and effect sizes). A computer algorithm searched through each paper for frequently occurring word and symbol combinations for reporting degrees of freedoms and effect sizes provided as Cohen’s D. When there were less than 20 empirical papers in a journal issue all empirical research reports with any t and F statistics reports were analysed. When there were more than 20 papers in an issue, a random sample of 20 papers were analysed merely because this was the upper limit of papers accessible in one query. This procedure sampled most papers in most issues and journals. All algorithms and computations were coded in Matlab © 2015b.

The whole database was checked for errors in reporting and potential double entries so that only unique records were kept for analysis. Further, during a validation procedure we randomly selected 100 papers with t value, df and effect size reports and 100 papers with F value and df reports. We manually checked through all records in these papers and compared them to the automatically compiled database. This was done to see the accuracy of the computer algorithm and to gather information on the data. Validation results showed that the automatic extraction mechanism performed very well. The papers selected for validation included 1478 records of data, the algorithm missed only 76 records (5.14%), usually due to atypical punctuation or line breaks. We found that the overwhelming majority of records related to two sample t-tests reported close to equal group numbers (median ratio of group numbers = 1). The ratio of the participant numbers in the larger group to the participant numbers in the smaller group was smaller than 1.15 in 77% of records. We also established that with a df value of 10 or less about 94% of tests were one sample or matched t-tests whereas about 72% of records with higher df values were one-sample t-tests. The F test validation procedure for 100 randomly selected papers showed that the algorithm missed only 84 out of 1084 records (92.25% of records correctly found).

### Computing effect sizes from t tests

t-test data was used for effect size, power and false report probability analysis as it is straightforward to estimate effect sizes from published t values. After checks for reporting errors 7 records with df>10,000 were excluded from analysis as clear outliers. This left 27,414 potential records. 26,841 of these records from 3801 papers had both degrees of freedoms and t values reported. We used this data for the effect size analysis. 17,207 t-test records (64.1%) were statistically significant (p≤0.05) and 9634 (35.9%) t-test records were statistically nonsignificant (p>0.05). 2185 t-test records also reported Cohen's D as a measure of effect size (1645 records with p≤0.05 [75.3%] and 540 records with p>0.05 [24.7%]).

As it is not possible to establish the exact participant numbers in groups for our large sample size making a few reasonable assumptions is inevitable. First, based on our validation data from 1478 records we made the assumption that participant numbers in two-sample t-test groups were equal on the average. The number of participants in groups was approximated as the upwards rounded value of half the potential total number of participants in the study, i.e.: N_{subgroup} = round_{upper}((df+2)/2). This formula even slightly exaggerates participant numbers in groups, so it can be considered generous when computing power. Second, regarding matched t-tests we assumed that the correlation between repeated measures was 0.5. In such a case the effect sizes can be approximated in the same way for both one-sample and matched t-tests. These assumptions allowed us to approximate effect sizes associated with all t-tests records in a straightforward way (Hunter and Schmidt, 1990; Fritz et al. 2002). Computational details are provided in **Supplementary Material, Section 2.**

Considering the validation outcomes we assumed that each record with a df of 10 or less had 93% chance to be related to a one-sample or matched-sample t-test and other records had 72% chance to be related to a one-sample or matched-sample t-test. Hence, we estimated the effect sizes for each data record with an equation assuming a mixture of t-tests where the probability of mixture depended on the df value:
where pr(t1|df) and pr(t2|df) refer to the respective probabilities of one sample and matched t-tests (t_{1}) and independent sample t-tests (t_{2}) and D_{t1} and D_{t2} refer to the respective effect sizes estimated for these tests.

The power of t-tests was computed from the non-central t distribution assuming the above mixture of one-sample, matched and independent sample t-tests (Harrison and Brady, 2004). Computational details are provided in **Supplementary Material, Section 3**.

Published effect sizes are likely to be highly exaggerated. Using these exaggerated effect sizes for power calculations will then overestimate power. Hence, if we calculate FRP based on power calculated from published effect size reports we are likely to estimate the lower limits of FRP. So, we estimated the lower limits for FRP, using the probably highly inflated effect sizes (computed from published t values) to calculate power for various H_{0}:H_{1} odds and bias values and with α = 0.05. (The computation of FRP is laid out in detail in **Supplementary Material, Section 4**.)

In order to get the expected value of FRP for the whole literature we weighed the FRP computed for each df and D value combination by the probability of that particular (df,D) combination occurring in the research literature and summed the results for all (df,D) combinations:

### Checking for p value reporting errors: F tests

F tests were used for examining p value reporting errors as this is more difficult with t-tests which may report either one and two-tailed p values. We extracted 75,904 records of data. 73,532 F-test records from 6273 papers were used for examining p value reporting errors. 51,920 records (70.6%) were statistically significant and 21,612 records (29.4%) were not significant. 68,762 records also explicitly reported p values (significant: 51,338 [74.7%]; nonsignificant: 17,424 [25.3%). This data allowed us to compare exact computed (from degrees of freedoms and F values) and reported p values.

## Acknowledgments

DS is supported by the James S McDonnell Foundation.