Optimal sample size calculation for null hypothesis significance tests

Traditional study design tools for estimating appropriate sample sizes are not consistently used in ecology and can lead to low statistical power to detect biologically relevant effects. We have developed a new approach to estimating optimal sample sizes, requiring only three parameters; a maximum acceptable average of α and β, a critical effect size of minimum biological relevance, and an estimate of the relative costs of Type I vs. Type II errors.This approach can be used to show the general circumstances under which different combinations of critical effect sizes and maximum acceptable combinations of α and β are attainable for different statistical tests. The optimal α sample size estimation approach can require fewer samples than traditional sample size estimation methods when costs of Type I and II errors are assumed to be equal but recommends comparatively more samples for increasingly unequal Type I vs. Type II errors costs. When sampling costs and absolute costs of Type I and II errors are known, optimal sample size estimation can be used to determine the smallest sample size at which the cost of an additional sample outweighs its associated reduction in errors. Optimal sample size estimation constitutes a more flexible and intuitive tool than traditional sample size estimation approaches, given the constraints and unknowns commonly faced by ecologists during study.

147 To directly compare sample size recommendations generated using the optimal α 148 approach with sample sizes generated using the traditional sample size estimation technique that 149 requires specifying desired α and β levels, we chose a two-tailed, two-sample t-test over a range 150 of critical effect sizes and relative costs of Type I vs. Type II error and compared differences in 151 sample size recommendations between a standard approach targeting α = 0.05 and β = 0.2 (80% 152 statistical power) and the optimal α approach targeting a maximum acceptable average of α and β 153 of 0.125, the value associated with using α = 0.05 and attempting to achieve 80% statistical 154 power.

(A) Type I errors considered half as serious as Type II errors, (B) Type I errors considered twice as serious as 249 Type II errors (C) Type I errors considered 10 times less serious than Type II errors and (D) Type I errors
250 considered 10 times more serious than Type II errors.

Correlation and ANOVA tests 252 253
The influences of critical effect size and maximum acceptable average of α and β on sample 254 size recommendation differ among test types (Fig. 6, Fig. 7). For simple linear correlation tests 255 (Fig. 6), sample sizes smaller than 10 are rarely informative, as these small sample sizes are only 256 recommended when either the critical effect size is ≥ r = 0.7 (R 2 = 0.5) and/or the maximum 257 acceptable average of α and β is ≥ 0.2. Sample sizes approaching 30 can be more informative, 258 with these sample sizes being recommended when either the critical effect size is r = 0.5 (R 2 = 259 0.25) and/or the maximum acceptable average of α and β is ≥ 0.125. In ANOVA study designs 260 (Fig. 7), increasing the number of levels of a factor decreases the number of replicates required 261 within each level of the factor. However, this decrease in the number of replicates required 262 within each factor level comes at the expense of an even greater increase in the total number of 263 replicates required among all factor levels, such that using more replicates within fewer factor 264 levels is more efficient than using fewer replicates within more factor levels. The total number of 265 samples in an ANOVA design required to achieve a maximum acceptable average of α and β of 266 0.05 for a critical effect size of f 2 = 1 is 16 with 2 groups (8 replicates * 2 groups), 24 with 4 267 groups (6 replicates * 4 groups), 30 with 6 groups (5 replicates * 6 groups), and 32 with 8 groups 268 (4 replicates * 8 groups).
269 284 Type I and II errors are equal, the optimal α sample size estimation approach can achieve the 285 same maximum acceptable average of α and β at the same effect size using fewer samples than 286 the traditional sample size estimation approach using α = 0.05 and 80% statistical power (Fig. 8).
287 As relative costs of Type I vs. Type II errors become more unequal, the sample sizes 288 recommended by the optimal α approach increase to the point where they surpass those 289 recommended by the traditional sample size estimation approach using α = 0.05 and 80% 290 statistical power. Optimal α sample size recommendations are larger than those recommended by 291 traditional sample size estimation approach for highly unequal relative costs of Type I vs. Type 292 II errors, however when relative costs of Type I and II errors are highly unequal, the practice of 293 using α = 0.05 and 80% statistical power becomes less and less justifiable.  (Table 1). Decreasing the Type I vs. Type II error cost from 1 to 0.25 resulted in small 306 increases in sample size recommendations of 2 to 6 samples. A 10-fold decrease in the maximum 307 acceptable average of α and β, and a 2-fold decrease in critical effect size each resulted in a 308 greater than 2-fold increase in the sample size recommendation.
309  (Table 2)  385 Differences among statistical test-types 386 387 The relationship between the maximum acceptable average of α and β, critical effect size, 388 and optimal sample size does not remain consistent among statistical techniques. For 389 correlation/regression tests, the sample size recommendation curves on the graph of critical 390 effect size vs. maximum acceptable average of α and β switch from curving outward away from 391 the origin to curving inward toward the origin at between 10 and 15 samples (Fig. 6). This means 392 that below 10 to 15 samples, correlation/regression tests are particularly ineffective for testing 393 null hypotheses as they are forced to choose between a very large critical effect size (for which 394 null hypothesis testing becomes increasingly unnecessary), or a very high maximum acceptable 395 average of α and β (or both). For ANOVA, the sample size recommendation curves on the graph 396 of critical effect size vs. maximum acceptable average of α and β have the same general shape as 397 for a t-test, with the key difference being that the curves shift downward as the number of 398 ANOVA factor levels increases, indicating that it takes fewer replicates per factor level to detect 399 effects of a given effect size as the number of factor levels increases. However, the total number 400 of samples required to detect a particular effect size will always be less with fewer factor levels.