A farewell to Bonferroni: the problems of low statistical power and publication bias

Shinichi Nakagawa

doi:10.1093/beheco/arh107

A farewell to Bonferroni: the problems of low statistical power and publication bias

Clicks: 3

ID: 290886

2004

Free PDF

Article Quality & Performance Metrics

Overall Quality

0.0 /100

Combines engagement data with AI-assessed academic quality

Reader Engagement

0.0 /100

0 views

0 readers

AI Quality Assessment

Not analyzed

Abstract

EN
- Turkish
- Spanish
- Portuguese
- Arabic
- Chinese
- French
- German
- Indonesian
- Russian
- Thai

Recently, Jennions and Moller (2003) carried out a metaanalysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true. Knowledge of effect size is particularly important for statistical power analysis (for statistical power analysis, see Cohen, 1988; Nakagawa and Foster, in press). There are many kinds of effect size measures available (e.g., Pearson’s r, Cohen’s d, Hedges’s g), but most of these fall into one of two major types, namely the r family and the d family (Rosenthal, 1994). The r family shows the strength of relationship between two variables while the d family shows the size of difference between two variables. As a benchmark for research planning and evaluation, Cohen (1988) proposed ‘conventional’ values for small, medium, and large effects: r 1⁄4.10, .30, and .50 and d 1⁄4.20, .50, and .80, respectively (in the way that p values of .05, .01, and .001 are conventional points, although these conventional values of effect size have been criticized; e.g., Rosenthal et al., 2000). The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or b) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 2 b) than if they had flipped a coin, when an experiment effect is of medium size (i.e., r 1⁄4 .30, d 1⁄4 .50). Here, I highlight and discuss an implication of this low statistical power on one of the most widely used statistical procedures, Bonferroni correction (Cabin and Mitchell, 2000). Bonferroni corrections are employed to reduce Type I errors (i.e., rejecting Ho when Ho is true) when multiple tests or comparisons are conducted. Two kinds of Bonferroni procedures are commonly used. One is the standard Bonferroni procedure, where a modified significant criterion (a/k where k is the number of statistical tests conducted on given data) is used. The other is the sequential Bonferroni procedure, which was introduced by Holm (1979) and popularized in the field of ecology and evolution by Rice (1989) (see these papers for the procedure). For example, in a recent volume of Behavioral Ecology (vol. 13, 2002), nearly one-fifth of papers (23 out of 117) included Bonferroni corrections. Twelve articles employed the standard procedure while 11 articles employed the sequential procedure (10 citing Rice, 1989, and one citing Holm, 1979). A serious problem associated with the standard Bonferroni procedure is a substantial reduction in the statistical power of rejecting an incorrect Ho in each test (e.g., Holm, 1979; Perneger, 1998; Rice, 1989). The sequential Bonferroni procedure also incurs reduction in power, but to a lesser extent (which is the reason that the sequential procedure is used in preference by some researchers; Moran, 2003). Thus, both procedures exacerbate the existing problem of low power, identified by Jennions and Moller (2003). For example, suppose an experiment where both an experimental group and a control group consist of 30 subjects. After an experimental period, we measure five different variables and conduct a series of t tests on each variable. Even prior to applying Bonferroni corrections, the statistical power of each test to detect a medium effect is 61% (a 1⁄4 .05), which is less than a recommended acceptable 80% level (Cohen, 1988). In the field of behavioral ecology and animal behavior, it is usually difficult to use large sample sizes (in many cases, n , 30) because of practical and ethical reasons (see Still, 1992). When standard Bonferroni corrections are applied, the statistical power of each t test drops to as low as 33% (to detect a medium effect at a/5 1⁄4 .01). Although sequential Bonferroni corrections do not reduce the power of the tests to the same extent, on average (33–61% per t test), the probability of making a Type II error for some of the tests (b 1⁄4 1 2 power, so 39–66%) remains unacceptably high. Furthermore, statistical power would be even lower if we measured more than five variables or if we were interested in detecting a small effect. Bonferroni procedures appear to raise another set of problems. There is no formal consensus for when Bonferroni procedures should be used, even among statisticians (Perneger, 1998). It seems, in some cases, that Bonferroni corrections are applied only when their results remain significant. Some researchers may think that their results are ‘more significant’ if the results pass the rigor of Bonferroni corrections, although this is logically incorrect (Cohen, 1990, 1994; Yoccoz, 1991). Many researchers are already reluctant to report nonsignificant results ( Jennions and Moller, 2002a,b). The wide use of Bonferroni procedures may be aggravating the tendency of researchers not to present nonsignificant results, because presentation of more tests with nonsignificant results may make previously ‘significant’ results ‘nonsignificant’ under Bonferroni procedures. The more detailed research (i.e., research measuring more variables) researchers do, the less probability they have of finding significant results. Moran (2003) recently named this paradox as a hyper-Red Queen phenomenon (see the paper for more discussion on problems with the sequential method). Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had ‘appropriately’ employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is Behavioral Ecology Vol. 15 No. 6: 1044–1045 doi:10.1093/beheco/arh107 Advance Access publication on June 30, 2004

Abstract Quality Issue: This abstract appears to be incomplete or contains metadata (1013 words). Try re-searching for a better abstract.

Reference Key	openalex_W1986694300 Use this key to autocite in the manuscript while using SciMatic Manuscript Manager or Thesis Manager
Authors	Shinichi Nakagawa
Journal	behavioral ecology
Year	2004
DOI	10.1093/beheco/arh107 Searching for DOI...
URL	https://academic.oup.com/beheco/article-pdf/15/6/1044/17274115/arh107.pdf https://doi.org/10.1093/beheco/arh107
Keywords	Keywords not found

Citations

No citations found. To add a citation, contact the admin at info@scimatic.org

Comments

Login to comment Register

No comments yet. Be the first to comment on this article.