Beyond Bonferroni Revisited: Concerns over inflated false positives in conservation genetics, genetics, and neuroscience

In 2006, Narum published a paper in Conservation Genetics that was motivated by the stringent nature of the Bonferroni approach for family wise error correction. That work suggested that the approach of Benjamini and Yekutieli in 2001 provided adequate correction and was more biologically relevant. However, there are crucial differences between the original Benjamini and Yekutieli procedure and that described by Narum. After carefully reviewing both papers, we believe that the Narum procedure is both different than the Benjamini and Yekutieli procedure and does not adequately control for family wise error. We provide an overview of approaches for FWE correction as well as evidence for the faulty implementation of the Benjamini and Yekutieli procedure by Narum using the equations from the respective papers, data from both papers, and the results of simulation.


Introduction
In 2006, Narum published a paper in Conservation Genetics motivated by the stringent nature of the Bonferroni approach for multiple testing correction, suggesting the False Discovery Rate (FDR) method proposed by (Benjamini and Yekutieli 2001) as an alternative that is both powerful but also more biologically relevant. His paper titled "Beyond Bonferroni: Less conservative analyses for Conservation Genetics" has been cited over 500 times [https://link.springer.com/article/10.1007/s10592-005-9056-y]. The article has not only been cited in the field of conservation genetics, but also has been increasingly cited in the fields of medicine and neuroscience. These studies apply the approach of Narum (2006) attributed to the Benjamini and Yekutieli (2001) (BY) procedure for muliple testing correction.
However, a careful review of the published BY approach and what Narum describes as the BY method, there are crucial differences. Due to an omission of one term, Narum's implementation of BY is incorrect and cannot be guaranteed to control the FDR. Thus, we believe that the Narum publication has created confusion about the BY procedure and its misuse is being propogated along an increasing number of studies. Thus, we have two goals of this paper: The first is to provide an overview of the Bonferroni method, the original Benjamini & Hochberg (2000) FDR (BH-FDR), and BY's method (BY-FDR); the second goal is to describe faulty implimentation of the BY-FDR approach described by Narum. We will demonstrate that using the multiple testing correction described by Narum results in an excessive number of false positives, especially when a larger number of multiple tests are performed.

Theory
We first review the different multiple testing approaches discussed by Narum (2006) using his notation as closely as possible. For a collection of k tests, each with a corresponding pvalue, ! , i=1,...,k. A multiple testing procedure identifies a subset of the k tests as significant while controlling for some measure of false positive risk that takes into account the number of tests performed. The Bonferroni method controls the family-wise error (FWE), the chance of one or more false positives, by using a fixed threshold of: where α FWE is the desired FWE level: All tests with ! ≤ α Bonf can be declared significant while controlling the FWE. Benjamini & Hochberg (2000) introduced the False Discovery Rate (FDR) for multiple testing correction. In describing the FDR it is useful to first define the false discovery proportion (FDP): FDP is the ratio of the number of false positive tests to total number of significant tests, defined as 0 if no tests are significant. The FDR is the expected value of FDP; put another way, FDR is the expected proportion of false positives among positives. To find FDR-significant tests, denote the ordered p-values (!) ≤ (!) ≤ ⋯ ≤ (!) . Then for a desired α FDR , let the index i * be found as * = max : ! ≤ ! ! FDR , and the tests with ! ≤ (! * ) can be declared significant while controlling FDR at α FDR .
The assumptions of this Benjamini & Hochberg FDR procedure (BH-FDR) are independence among the test statistics (Benjamini & Hochberg, 2000). However, BY found that weaker assumptions could be used, allowing a general form of positive dependence among the test statistics. The BY work, however, also proposed another method for controlling FDR that makes no assumptions about the dependence among the tests, as long as a more stringent criterion was used (Theorem 1.3, BY), with the index i * BY computed: With this approach, the tests with ! ≤ (! BY * ) are marked significant and FDR is controlled at Euler's constant. This is the method we refer to by BY-FDR.
We can now make a quick comparison of three methods on the basis of the smallest p-value (!) : Bonferroni has the fixed threshold α FWE /k, while BH-FDR will compare (!) to α FDR /k and BY-FDR will compare (!) to approximately α FDR /(k log(k)). Of course, BH-FDR and BY-FDR are adaptive and compare increasing p-values to successively more lenient thresholds, but this comparison for (!) points to how BY-FDR is much more stringent than BH-FDR. Now, in Narum (2006), the author incorrectly states that the BY-FDR threshold is fixed and equal to: This is a fundamental error, as a key feature of FDR methods is that they are adaptive. The error arose from neglecting that the equation above was merely one component of the BY procedure (to be substituted for q in B-Y Eq. (1) on pp. 1167 (Benjamini and Yekutieli 2001)). The Narum procedure results in a fixed threshold for a specific k.
Since a fixed threshold specifies the average or per comparison error rate (PCE), we can assess the impact of this error. Assuming the complete null, i.e. no signal for any test, k × PCE is the expected number of false positives. For the threshold at the 0.05 level, for k = 105, k × PCE ≈ 1, while for k = 1590, k × PCE ≈ 10. This demonstrates that Narum's result can be assured to produce an increasing number of false positives for an increasing k. In contrast, for Bonferroni k × PCE is exactly α FWE , i.e. always less than 1, and every valid FWE or FDR level α procedure is guaranteed to produce no false positives with probability 1-α (again, in this complete null setting). While the Narum approach does asymptote to zero as k approaches infinity, it approaches zero extremely slowly. For example, with 10 million tests performed, the Narum p-value threshold is 0.003, in contrast to the Bonferroni threshold of 0.000000005.
To evaluate the rate of significant p-values between the Bonferroni, B-H, B-Y, and Narum's interpretation of the B-Y approach we conducted a simulation. We created 50,000 random realizations where random p-values were computed from a standard Normal distribution. We considered k ranging from 1 to 30 tests, where all tests were independent, and used nominal α FWE = α FDR = 0.05 for all methods. In this null setting, any "discovery" is a false discovery and so measured FDR and FWE will be the same.  Bonferroni is often regarded as conservative, in this setting of small k and independent tests, it is essentially exact). The FDR/FWE of BY-FDR becomes increasing conservative while Narum's method has inflated false positives even for k=2 tests, and has a near linear increase with increasing k. In all realizations there was never more than 1 detection, and hence the PCE was identical to the FDR/FWE (not shown).
We also consider the specific set of 15 p-values used in Narum (2006), tabulating the pvalue threshold that were used for significance testing for each of the four methods. Table 1 shows the thresholds used for each of the 15-exemplar p-values, with significant tests marked in bold. It can be seen that the BY-FDR and the Narum approach are not the same, with Narum finding 4 significant tests as compared to BY-FDR's having two significant tests.

Discussion
Approaches for multiple testing correction have been present for over half a century. In the late 1950's, Olive Jean Dunn adapted the Italian mathematician Carlo Emilio Bonferroni's theory of inequalities for use in statistics (Dunn 1961 We believe that Narum used an equation from the BY paper (shown above) out of context.
A careful reading of Benjamini and Yekutieli (2001)  We do agree with Narum that the Bonferroni approach is often conservative for multiple testing correction, especially with dependent data. However, there has also been a growing concern that many studies fail to replicate (Ioannidis 2005; Open Science Collaboration 2015; Nichols et al. 2017). In the past, analyses were performed without adequately controlling for the numbers of tests performed (Carp 2012) which resulted in numerous type I errors. We know of no justification to use the procedure described by Narum for multiple testing, and are unaware of any formal metric of false positives that it controls. Thus, we would recommend that this approach not be used for multiple testing correction and the work corrected to properly impliment the BY-FDR approach.