Abstract
After an experiment has been completed and analyzed, a trend may be observed that is “not quite significant”. Sometimes in this situation, researchers incrementally grow their sample size N in an effort to achieve statistical significance. This is especially tempting in situations when samples are very costly or time-consuming to collect, such that collecting an entirely new sample larger than N (the statistically sanctioned alternative) would be prohibitive. Such post- hoc sampling or “N-hacking” is denounced because it leads to an excess of false positive results. Here simulations are used to illustrate and explain how unplanned incremental sampling causes excess false positives. In a parameter regime representative of practice in many research fields, however, simulations show that the inflation of the false positive rate is surprisingly modest. The effect on false positive rate is only half the story. What many researchers really care about is the effect of N-hacking on the likelihood that a positive result is a real effect: the positive predictive value (PPV). This question has not been considered in the reproducibility literature. The answer depends on the effect size and the prior probability of an effect. Although in practice these values are not known, simulations show that for a wide range of values, the PPV of results obtained by N-hacking is in fact higher than that of non-incremented experiments of the same sample size and statistical power. This is because the increase in false positives is more than offset by the increase in true positives. Therefore, in many situations, adding a few samples to shore up a nearly-significant result would in fact increase reproducibility, counter to current rhetoric. To strictly control the false positive rate on the null hypothesis, the sampling plan (and all other study details) must be prespecified. But if this is not the primary concern, as in exploratory studies, collecting additional samples to resolve a borderline p value can confer previously unappreciated advantages for efficiency the positive predictive value of the generated hypotheses.
Background
There has been much concern in recent years concerning the lack of reproducibility of results in some scientific literatures (1). The call for improved education in statistics and greater transparency in reporting is justified and welcome. But if we apply rules by rote, we as a community risk throwing out a lot of babies (good data, promising leads) with the statistical bath water. Experiments in biology often require substantial financial resources, scientific talent, and use of animal subjects. There is an ethical imperative to use these resources efficiently. To ensure both reproducibility and efficiency of research, experimentalists need to understand statistical issues rather than blindly apply rules.
The rule brought into question here is a cornerstone of null hypothesis significance testing: test exactly the predetermined sample size N, and then accept the verdict of the hypothesis test, whatever it is. Empirical scientists are accustomed to looking at data, so simulation is an excellent way to gain intuitions about the implications of statistical methods. Here I simulate the questionable research practice of “N-hacking” – incrementally adding more samples after the fact whenever a preliminary result is “almost significant”.
This study began with the intent of demonstrating the known dire consequences of this practice, but obtained an effect an order of magnitude smaller than previously reported (2-5). The discrepancy was traced to parameter choices: I had used parameters reflective of real-world practice in experimental biology, whereas published demonstrations had used unrealistic ones. After exploring a broad range of parameters bracketing most biology experiments, it emerged that in the relevant parameter regime for Biology, the elevation in false positive rate is quite modest and lawfully predictable. Moreover, the effects on reproducibility (PPV) – which have not been previously explored – turned out to be beneficial, not harmful. These results were both unexpected and robust. This parameter regime may a “special case” of no interest to the field of theoretical statistics, but it is the only case of interest to experimentalists.
These simulations were meant to describe what researchers in fact do, not to prescribe what they should do. The goal is not to dismiss concerns about sampling procedures, but rather to clarify them in order to better inform choices. Readers will gain working intuitions about why N-hacking is a problem, and how the magnitude and direction of the resulting bias depend on the details of decision heuristics. The results show that in an exploratory study, judicious sample incrementation can be a better option than either starting over from scratch or abandoning a hypothesis after obtaining a nearly-significant outcome. The results also motivate why formal sequential sampling protocols could be a better choice for biology studies that require confirmatory p values.
Results
These simulations can be taken to represent a large number of independent studies, each collecting separate samples to test a different hypothesis. All simulations were performed in MATLAB 2018a. Formal definitions of terms and symbols are summarized in a side box.
Part I. Effect of incrementally growing sample size on the false positive rate
Experiments were simulated by drawing two independent samples of size N from the same normal distribution. An independent sample Student’s t-test was then used to accept or reject the null hypothesis that the samples came from distributions with the same mean. Because the samples always came from the same distribution, any positive result is a false positive. I will call the observed false positive rate when the null hypothesis is true FP0 (“FP null”), also known as the Type I Error Rate. I assume that the significance criterion α has been set in advance. By construction, the t-test produces false positives at a rate of exactly α, the significance threshold. The MATLAB code used for simulating the false positive rate on the null hypothesis (FP0) can be found in (6), along with the numeric results of the all simulations described in Figures 2-4.
A cautionary scenario
Suppose 10,000 separate labs each ran a study with sample size N=8, where in every case there was no true effect to be found. If all used a criterion of α = 0.05, we expect 500 false positive results. But suppose all the labs that got “nonsignificant” outcomes reasoned that their studies were underpowered, and responded by adding four more data points to their sample and testing again, repeating this as necessary until either the result was significant or the sample size reached N=1000. The interim p values would fluctuate randomly as the sample sizes are grown (Figure 1a). In two of the cases shown (red and blue curves) the p value crossed the significance threshold (α=0.05, black line) by chance. Had these studies ended as soon as p < α and reported significant effects, these would represent excess false positives, above and beyond the 5% we intended to accept. Dashed curves show how these “p values” would have continued to evolve if sampling had continued.
In a simulation of 10,000 such experiments, the p values for the initial N=8 samples were uniformly distributed between 0 and 1 (Figure 1b, blue), with 495 cases (∼5%) falling below 0.05 (red line). After N-hacking, there were 4262 false positives instead of the expected 500 (Figure 1B, black). Therefore, the final “p values” are not really p values – they do not reflect the probability of obtaining the result by chance if the null hypothesis is true. This alarming result has been pointed out by many others (1-5), and serves to illustrate that N-hacking can be a serious problem for anyone operating in this regime.
This scenario postulates extremely industrious researchers, however. Suppose the experimental units were mice. For the 5% of labs that obtained a false positive at the outset, the sample size was a reasonable N=8 mice. All other labs had larger final samples. Three quarters of the labs tested over 100 mice, and over half of the labs tested 1000 mice before giving up. This simulation also postulates extremely stubborn researchers: in 75% of the simulated runs, additional data were collected even after observing an interim “p value” in excess of 0.90. In my experience in experimental biology research, these choices are implausible.
A plausible scenario in experimental biology
Suppose instead that the sample size would be increased only for those tests that meet a criterion of “p close to α”. Furthermore, suppose that the maximum number of samples the study could or would add is no more than a few times greater than the original sample size. I simulated such an Asymmetric N-increasing policy as follows: every time a comparison yielded a p value that was “almost significant”, additional samples were added incrementally, and the t- test repeated. This was iterated until the p value was either significant, or no longer close, or the maximum number of samples was reached. The definition of “almost significant” was: α ≤ p < (1 + w) α, where 0 < w ≤ 1. For example, if α = 0.05 and w = 0.2, one would accept a hypothesis if p < 0.05, reject if p > 0.06, and add samples for p values in between. Results of such a policy are shown in Figure 2.
As expected, this Asymmetric N-increasing policy yielded an increase in the rate of false positives FP0, and this was more severe as the eligibility window w increased (Figure 2a). Nevertheless, the overall elevation in false positives was rather modest. For example, with a policy of α = 0.05 and w = 1, sample size was grown whenever p was between 0.05 and 0.10, resulting in a realized false positive rate FP0 = 0.0625 instead of the nominal 0.05. Following this policy resulted in a negligible increase in the sample size on average (Figure 2b). Note that the false positives due to multiple comparisons are included in these reported false positive rates, i.e. these are the uncorrected false positive rates.
To many, it is counterintuitive that adding more observations could do anything but improve statistical rigor – more N is better, right? The main reason false positives are elevated is that experiments were chosen for incrementation in a biased way. By selectively incrementing only the subset of true negatives in which the difference between experimental and control groups was rather large, and thus nearly significant, even a small difference between groups in the added samples would be sufficient to push the overall group difference over the threshold for significance, purely by chance.
The problem is that the rule is asymmetric: it challenges a preliminary result when p is just above threshold, but not when it is just below threshold. To demonstrate this point I also simulated a Symmetric N-increasing policy, in which incremental sample growth occurred whenever a p value was close, whether below or above α: (1− w) α ≤ p < (1 + w) α. Making the policy symmetric more than overcomes the problem – it converts more false positives to true negatives than it converts true negatives to false positives, resulting in a net reduction in false positives (Figure 2c). This is because in addition to the effect noted above, the Symmetric policy also incremented the sample size in a biased subset of the false positives: ones in which the difference between experimental and control groups was rather small and thus barely significant. The Symmetric policy resulted in a slightly larger final sample size on average (Figure 2d).
In discussions of statistical malpractice, it is often asserted that an experimentalist would never add more samples after obtaining a significant p value, but interestingly there is evidence that some do (7), and my observations of real practice in biology concur with this. Therefore, the consequences of both policies will be explored further below.
Dependence on α and the eligibility window w
For the Asymmetric N-increasing policy, analysis of the simulated data reveals that for any given choice of w, the false positive rate depends linearly on α: FP0 = kα (Figure 3a). The slopes of these lines are in turn an increasing function of the decision window w (Figure 3b, symbols). On the Symmetric policy, the dependence of FP on α is also linear (Figure 3c) and the slope k declines with w (Figure 3d).
Dependence on initial sample size Ninit and increment size Nincr
Above I assumed an initial sample size of 12, adding 6 more samples at a time, up to a maximum of 24 samples. To determine if these results were a peculiarity of these assumptions, I repeated the simulations for Ninit ranging from 2 to 128 initial sample points, adding Nincr ranging from 1 to Ninit samples each time, and capping the maximum total sample size at Nmax = 256. These assumptions more than bracket the range of realistic sample sizes and ad-hoc sample growth that would be commonly used in many experimental biology fields.
Results for the Asymmetric policy with α = 0.05, w = 0.4 are shown in Figure 4a. The false positive rate FP0 is always elevated compared to α (black line), but this is more severe when the intial sample size is larger (curves slope upward) or the incremental sample growth is smaller (cooler colors are higher).
Nevertheless the false positive rate didn’t exceed 0.06 for any condition. In this range of parameters, the dependence of k on w was approximately linear, so one can summarize the results for all combinations of of α and w by linearly scaling them (Figure 4b-c). In the case of the Symmetric policy, the false positive rate FP0 is always lower than α; this beneficial effect is strongest when Nincr is large or Ninit is small (Figure 4d-f). In summary, the effect of uncorrected incremental sampling on the false positive rate is real, but it is modest in size and lawfully related to a handful of parameters.
The “p value” obtained after unplanned incremental sampling is still not a true p value. A number of methods are available for planned incremental sampling or p value correction (4, 8-13). If the sampling policy was not set in advance, however, a correction of the p value can only be an estimate, because you can never truly know (or prove) what you would have done if the data had been otherwise. The point here is that in exploratory studies, if one limits unplanned incrementation to cases where the initial p value is rather close to α, the bias introduced by incrementation is not very large. For example, if one’s cutoff for ad hoc sample incrementation is p < 2α (corresponding to w = 1), the false positive rate will never be elevated by more than a factor of 1.5 (see Appendix 2). Therefore, if one does a Bonferroni correction for the multiple comparisons involved (a factor of 2 or more, depending on how many times one incremented) one will have more than corrected for this deviation from the plan.
Part II. Trade-off between statistical power and positive predictive value
So far these simulations still make another unrealistic assumption: that the null hypothesis is always true. In real research, presumably at least some studies testing for effects that in reality do exist. N-hacking increases the false positive rate expected on the null hypothesis because some true negative results will by chance be converted to false positives when a few samples are added. But the researchers’ motivation for adding samples is the hope of increasing sensitivity: some “almost-significant” effects are false negatives, which might be converted to true positives with added samples. How these effects balance depends on what fraction of the tested hypotheses are in fact true (prior probability of effect, P(H1)) and how large the effects are when present (effect size, E). The reason for this is nicely explained in (14).
To explore this in simulations, one must simulate some experiments with no effect (as above) and other experiments with real effects. In simulations, we know the ground truth about which experiments had real effects, so we can directly measure two important quantities: (1) the sensitivity or power, which is defined as the fraction of real effects for which the null hypothesis is rejected; and (2) the selectivity, or positive predictive value (PPV), which is defined as the fraction of all positive results that are real effects (as opposed to false positives).
The sensitivity-selectivity trade-off
Simulations were done exactly as described above, but now 1% of all experiments were simulated with a real effect of 1σ difference between the population means, such that rejecting the null is the correct conclusion. The remaining 99% of experiments had no real effect. The fixed-N policy was compared to either an Asymmetric or Symmetric N-increasing policy.
First it is helpful to recall that in the standard fixed-N policy there is always a trade-off between sensitivity and selectivity, which is controlled by the choice of α. For a given sample size N, increasing the arbitrary cutoff for significance α increases sensitivity, at the expense of reduced PPV (e.g., Figure 5a, any red curve slopes downward). By varying α one can define a curve for the sensitivity-selectivity trade-off, which summarizes the options available for interpreting data sets acquired in this way. The choice of α is up to the investigator, depending on the relative priority one sets on avoiding missing real effects vs. avoiding believing false ones.
Simulating this for different choices of N further illustrates that in a fixed-N policy, a larger sample size N is always better: it increases both sensitivity and selectivity, moving the entire curve up and to the right (Figure 5a, compare any two red curves). Drawing on this intuition, the statistical quality of any two experimental policies can be compared by relating these curves. A higher curve is better – it means one could choose α for any desired PPV and achieve higher Power; or choose α for any desired Power and achieve higher PPV, compared to any curve that lies below it.
The curves for the standard fixed-N policy (red curves, Figure 5) thus provide the benchmark to which other sampling policies may be compared. An example Asymmetric N- increasing policy is shown (blue curves, Figure 5a). Because samples were added to only a few experiments, the average final sample size was negligibly greater than the fixed-N policy: ⟨Nfinal⟩ ≤ 1.02 Ninit for all parameter combinations tested (c.f. Figure 2b). Therefore, the overall sensitivity and selectivity of the policy can be reasonably compared to the fixed-N policy with N = Ninit (paired curves). For all choices of Ninit simulated, the curve for the Asymmetric N-increasing policy (blue) fell entirely above and to the right of the corresponding curve for the fixed-N policy (red). Thus the Asymmetric N-increasing policy resides entirely on a better frontier than the fixed-N policy: for any point on the fixed-N curve there exists some choice of α for which the Asymmetric policy curve has equal selectivity with higher sensitivity, and another choice of α for which the Asymmetric policy has equal sensitivity with higher selectivity.
Comparing the two policies with the same choice of α is also informative (symbols of same shape on the red vs. paired blue curves). For the parameter combinations with lower power (Ninit = 6 or 12 with any α, or Ninit = 24 with α < 0.01), using the same choice of α in an Asymmetric N-increasing policy – even without any correction for the false positive rate or multiple comparisons – yielded improvements in both statistical power and PPV relative to fixed-N. This was the case up to at least w = 1 (not shown). For the parameter combinations with higher power (Ninit = 48 with any α, or Ninit=24 with α ≥ 0.01), using the same α for the Asymmetric N-increasing policy led to a loss in selectivity relative to the fixed-N policy (the matched symbols are to the left of their fixed-N benchmarks). Still, this loss in selectivity was accompanied by a far greater improvement in statistical power than could be achieved by moving along the red curve (changing α) to obtain the same selectivity. In this sense, the Asymmetric policy represented a superior trade-off even in these cases.
The small subset of experiments for which sample size was increased had 2Ninit final samples. Is the whole effect due to the fact that un-incremented experiments lie on the fixed-N curve for N = Ninit and the incremented subset lie on the curve for N = 2Ninit? The answer is no. Considering the incremented subset of experiments separately (dotted blue curves) reveals that they live on a frontier above the curve for fixed-N experiments with a sample size of N = 2Ninit. The subset of experiments that were not incremented (the majority, which had a final sample size of exactly Ninit) lay on a curve that was slightly above or indistinguishable from the fixed-N benchmark in all cases examined (not shown).
The Symmetric N-increasing policy was superior to the fixed-N policy in every way (Figure 5b, compare red to blue), as well as beating the Asymmetric policy (compare blue curves in Figure 5a vs. 5b). Even using the same choice of α the Symmetric policy increased both selectivity and sensitivity relative to fixed-N for all conditions tested.
These simulations demonstrate that for an effect size of E = 1σ and prior probability of 0.01, N-hacking is a win-win scenario. Although the absolute numbers depend on the effect size E and fraction of experiments that had real effects P(H1), the relationships between the curves were the same for effect sizes ranging from E = 0.5 to 2 and prior P(H1) ranging from 0.001 to 0.1 (not shown). Additional simulations showed that this remained the case as either the prior probability or effect size approached 0 (although PPV approaches 0 in both cases), for a range of Ninit using Nincr = Ninit (not shown). In real experiments, E and P(H1) are not known, but this doesn’t prevent us from concluding that regardless of their values, N-hacking in this regime would improve reproducibility.
Dependence on the eligibility window w
In Part I, I showed that if one only adds samples when p is rather close to α, the false positive rate FP0 is only moderately elevated (Figure 2), but if one used a larger eligibility window w, the false positive rate could be quite high among experiments with no real effect (Figure 1). Does the benefit of N-hacking fall apart when w gets large? To test this, I further simulated results of the Asymmetric policy under this condition, for w ranging from 0.2 to 10, also varying α to define the power-PPV curves. As w increases, these curves move up and to the right (Figure 6, top row). This implies that even if one uses very loose criteria for adding samples, N-hacking has some benefits.
For a fixed choice of α, increasing w always increases sensitivity (warm colors are above cool colors along any curve, Figure 6 bottom row). This makes sense: the more willing one is to add a few more samples, the more false negatives one can rescue to true positives.
For larger sample sizes, however, uncorrected N-hacking (holding α constant) reduces positive predictive value (e.g. Ninit = 16, gray curves slant to the left) compared to a fixed-N policy (dark blue symbols). Nevertheless, the trade-off between PPV and Power is advantageous. For example, consider Ninit = 8, α = 0.05. In this case the fixed-N experiment has a PPV of 0.50 (not 0.95, as the experimenters might falsely believe), and a statistical power of 0.46. Asymmetric N-hacking with a window of w = 5 means that more samples would be added for any interim test result of 0.05 < p < 0.25. Without any correction for incremental sampling or multiple comparisons (as shown), this would erode the PPV from 0.50 down to 0.43 (in other words the False Positive Risk would be 57% instead of 50%). But in exchange for this, the statistical power would be increased from 0.46 to 0.78. The investigators would be slightly more likely to believe a result that is a fluke, but far more likely to find a real effect if it is there.
Discussion
Main conclusions
These simulations demonstrate that increasing the sample size incrementally whenever a result is “almost” significant will lead to a higher rate of false positives, if the null hypothesis is true. This has been said many times before, but most writers warn that this practice will lead to extremely high false positive rates (1-5). We can replicate those results if we use the same assumptions: that an experimentalist would add more samples after obtaining a non-significant result no matter how far from α the p value was, and would continue adding samples until N is quite large (Figure 1). If instead one considers circumstances in which the p value would have to be rather close to α for one to add samples (e.g., no more than twice α), and a limited number of total samples could be added before giving up (e.g., no more than five times the initial sample), the effects on the false positive rate are modest and bounded.
The magnitude of the increase in the false positive rate depends on the initial sample size Ninit, significance criterion α closeness criterion w, increment size Nincr, and total sample cap Nmax. These simulations demonstrate in which direction and how steeply the false positive rate depends on these factors. Some rules of thumb emerge for how bad the effect could possibly be, given those parameters (Appendix 2). While this cannot be used to formally correct the p value, it could provide useful guidance to the researcher in an exploratory study.
Further simulations demonstrated that under many conditions this type of N-hacking is superior to a fixed-N policy in the sense that it increases the positive predictive value (PPV) achievable for any given statistical power, compared to studies that strictly adhere to the initially planned N. This has not been previously noted, and was unexpected. For experimental studies that use a small initial sample size (N ≤ 12) and α = 0.05, if one would only add samples post-hoc when p < 0.1 and would always quit before exceeding N = 50 samples, it is simply not true that N-hacking leads to an elevated risk of unreproducible results as often claimed. A verdict of “statistical significance” reached in this manner, far from being dubious, is more likely to be reproducible than results reached by fixed-N experiments with the same initial or final sample size – even if no correction is applied for sequential sampling or multiple comparisons.
With the noble motivation of improving reproducibility, researchers are now being told that if they have obtained a non-significant finding with a p value just above α, they must never add more samples to their data set to improve statistical power. They must either run a completely independent larger-N replication, or fail to reject the null hypothesis (which generally means relegation to the file drawer, in the current publishing climate). To dissuade researchers from unplanned sample incrementation, multiple didactic articles have shown that the resulting false positive rate would be wildly inflated (c.f. Figure 1). These demonstrations were unrealistic and misleading. To make informed choices, researchers need more relevant and nuanced information about the trade-offs they must negotiate.
So, is N-hacking ever OK?
Adding samples after completing the planned experiment violates a basic premise of null hypothesis significance testing (NHST), and forfeits control of the Type I Error rate. But if the goal is to generate hypotheses that are likely to be reproducible, many researchers might validly be willing to abandon having an exact p value in exchange for reducing the risk of false negatives, improving the positive predictive value, and conserving time, animal lives, and other resources. In an explicitly exploratory study, some statisticians might concede that unplanned sample incrementation is not even N-hacking.
For researchers conducting transparently exploratory studies, then, these simulations could inform better informal decision heuristics about sample growth. An exploratory study should be labeled as such, disclose that sample incrementation occurred, report the interim N and p values, and describe their decision heuristics as honestly as possible. The simulations presented here would help a reader interpret the implications of those choices.
But if an exact p value is required, as in a confirmatory study, no deviation from the prospective experimental design is OK, including N-hacking. That doesn’t rule out incremental sampling, however. It would not be N-hacking if the incrementation policy were committed to advance, because pre-specification makes it possible to determine the results expected when the null hypothesis is true, at least by simulation.
If one is going to pre-specify an incremental sampling plan, however, one could probably do better than the ad-hoc heuristics simulated here, which were meant merely to describe what I believe to be common lab practices. It is beyond the present scope to explain and compare sequential sampling methods, and others have ably done so (15, 16). Here I will just provide a brief indication of some options.
One option is a phased study. For example, one could prespecify a 2-phase protocol with an initial phase of N = 16 and α = 0.10, followed (if a “significant” effect is found in Phase I) by a second phase with N = 33 and α = 0.01. Compare this to a Symmetric N-increasing policy with Ninit = 16, Nincr = 1, α = 0.05, w = 1, Nmax = 128. In both cases, additional data will be collected whenever the initial sample yields p < 0.10. If prespecified (and no other deviations from the research plan occurred) both would have strictly interpretable p values. In a simulation with an effect of size 0.5 SD and a prior probability of 0.1 (106 runs) these both had an average sample size of about 20, a statistical power of about 26%, a PPV of about 92%.
Another option is Wald’s Sequential Probability Ratio Test (17), which has been proven to be optimal in some respects. In Wald’s method, one sets in advance a threshold a to accept and another threshold b to reject the null hypothesis. Then one computes a test statistic S after each new sample point is added (the cumulative log likelihood ratio of the alternatives). If S ≤ a the null hypothesis is accepted, if S ≥ b the alternative is accepted, and if a < S < b, one continues sampling. The thresholds a and b can be set analytically to obtain the desired statistical power β and false positive rate α. Superficially, the N-incrementing policies simulated here resemble Wald’s method in that there are two thresholds and an indeterminate range between them when sampling continues, but Wald selects these two thresholds in an optimal way. A downside of Wald is that one must commit to sampling until one or the other threshold is crossed, which puts one at risk of having to test a very large N.
A third option is Bayesian Sequential Sampling. This method sets a criterion c, and then sequentially computes the Bayes Factor for the hypothesis vs. null hypothesis. If BF > c the hypothesis is accepted, if the hypothesis is rejected, and otherwise one keeps sampling (18). This is also closely related to Wald’s method and the drift diffusion model (DDM) of decision-making (19), and does not require knowledge of the prior probability.
Broader Implications
In the effort to promote rigor in science, we need to question “questionable” research practices more deeply. Some may be inevitably and severely misleading (20). Others may have small effects, or only in specific circumstances. The potential for abuse does not establish actual abuse; sometimes the same practice (e.g. “unplanned sample incrementation”) could either reduce or increase reliability of research, depending on exactly how it is deployed. A more realistic and nuanced exploration is far more instructive for researchers, and can lead to more useful suggestions for improved practice of science.
Many experimental studies in Biology are exploratory, involving not only unplanned incremental sampling but also iterative revisions of the experimental methods, analysis methods, and hypotheses. In such studies one cannot obtain a confirmatory p value, even if the sampling plan is prespecified. However, this flexibility may be essential to the success of the research in terms of making valid, novel discoveries efficiently. Therefore, science reforms that seek to turn all research projects into confirmatory research could backfire. Instead, we in Biology need to be more open about labeling exploratory studies as such (including refraining from reporting p values or telling null-hypothesis-testing stories), and work harder to articulate the methods and heuristics we routinely employ to ensure scientific rigor in the context of exploratory studies.
Definitions
Acknowledgements
The author acknowledges Hal Pashler, Casper Albers and Daniel Lakens for valuable discussions and helpful comments on earlier drafts of the manuscript.
Appendix 1 Extensions and limitations of these results
These simulations used a normal distribution for the source distributions and an independent sample t-test as the hypothesis test. But the analysis of the false positive rate FP0 only depends on the assumption that the statistical test used generates p values that are uniformly distributed between 0-1 on the null hypothesis. In other words, as long as the statistical test being used is valid for the distribution being sampled and the structure of the experiment, the dependence of false positive rate on parameters in these simulations should generalize to any source distribution and statistical test. Power analysis may be affected by the shape of the source distribution, however, so generality of those results to other distributions should not be assumed.
I have simulated the practice of unplanned sample incrementation after computing a p value on the initially planned sample. But even if no interim statistical tests are performed, the same issues arise. For example, deciding whether to collect more data depending on the effect size seen in the initial data, or based on visual inspection of scatter plots, is also N-hacking.
In the real world, the prior probability of a true effect P(H1) and the effect size E are unknown to the investigator. But in simulations the effect size and prior probability are known. Testing a wide range of values, it was possible to draw general conclusions about the direction of the effect of N-hacking, entirely on frequentist grounds. In “underpowered” conditions (low N, stringent α, small effect size, low prior probability), whatever the PPV would have been using fixed-N, the PPV after Asymmetric N-increasing would be greater. Under other conditions, the PPV after Asymmetric N-increasing is lower than that of a fixed-N experiment, but there still exists some choice of α that would provide the same PPV as fixed-N with higher power, and some other choice that would provide the same power with higher PPV. How to find these values of α is not addressed, however.
These simulations only considered experiments in which a single hypothesis is tested on each sample. Multiple tests on a single sample (such as a gene chip array experiment) is a very different situation, because in that case incrementing N and retesting would lead to re-testing of all the hypotheses, regardless of their original p values. That situation is not considered here.
Numerical simulations and graphs are easy for experimentalists to understand, because they present the expectations of the hypothetical scenario in terms directly comparable to data. But I have not attempted an analytic treatment that would allow for a proof or specification of the conditions under which these results obtain.
These simulations asked: if a population of scientists followed a certain sampling policy, what fraction of their experiments would yield a “significant” difference when the null hypothesis was in fact true (FP0), and what fraction of their “significant” findings would be real effects (PPV)? These are population-level questions. When interpreting any single experiment, however, one should take into account the specific p values that were obtained (a “p-equals” rather than “p-less-than” approch)(21).
As others have noted, “chasing significance”, such as by N-hacking, may be incentivized by the currently standard practice of setting arbitrary cutoffs for “statistical significance” and reducing analog p values to binary hypothesis tests. It is not at all clear that experimental science is well served by this overall approach (23-25, 26). But since N-hacking biases the p value itself, the issues explored here would arise even if no decision threshold were used.
Many statisticians advocate supplementing reported p values with some other statistical measure such as the Odds Ratio (21, 27), Bayes Factor(18, 28), the False Positive Risk (1-PPV) (22), or a non-Bayesian bound on the Bayes Factor (29). Some of these measures do not make any assumptions about how data were collected, and in this respect are immune to concerns about N-hacking.
Appendix 2 A conservative bound?
If the simulated decision rules were implemented as strict policies, the simulated data show the following inequalities (dotted lines, Figure 4 c,f): These are loose bounds (in many conditions the false positive rate falls well below this value), but have the virtue of being easy to calculate. For example: an Asymmetric N-increasing policy with w = 0.4, Ninit = 10, Nincr = 10, Nmax = 50, would have an estimated FP0 < 0.0550 by rule of thumb, compared to the simulation result of FP0 = 0.0541 ± 0.0001.
Additional simulations for α = 0.05 or 0.10, Nincr = 1 (i.e. the worst case conditions) were extended to w = 19 for Ninit = 2 to 128 with Nmax = 256 and still did not exceed this empirical bound (not shown). The MATLAB code provided in (6) can be used to simulate the false positive rate for other parameter combinations.
In principle, these inequalities could be used to estimate a bound on the false positive rate or estimate a corrected p value after unplanned sample incrementation if the heuristic decision rule can be articulated. This estimate will be conservative if one assumes an asymmetric policy, a larger window w than one thinks one would ever increment, and a maximum sample size Nmax larger than one thinks one would ever collect. But this will still be only an estimate, unless the decision policy is fixed in advance.
Footnotes
Added a figure (now Figure 1). Corrected some inaccurate or unclear statements. Moved minor points to appendices.