Lately, I have been a mood of scientific scepticism: I have the feeling that the worldwide academic system is more and more failing to produce useful research. Christophe Lalanne’s twitter feed lead me to an interesting article in a non-mainstream journal: A farewell to Bonferroni: the problems of low statistical power and publication bias, by Shinichi Nakagawa.
Each study performed has a probability of being wrong. Thus performing many studies will lead to some wrong conclusions by chance. This is known in statistics as the multiple comparisons problem. When a working hypothesis is not verified empirically in a study, this null finding is seldom reported, leading to what is called publication bias: discoveries are further studied; negative results are usually ignored (Y. Benjamini). Because only discoveries, called detections in statistical terms, are reported, published results contain more false detections than the individual experiments and very little false negatives. Arguably, the original investigators have corrected using the understanding that they gained the experiments performed and account in a post-hoc analysis for the fact that some of their working hypothesis could not have been correct. Such a correction can work only in a field where there is a good mechanistic understanding, or models, such as physics, but in my opinion not in life and social sciences.
Let me quote some relevant extracts of the article, as you may never have access to it thanks to the way scientific publishing works:
Recently, Jennions and Moller (2003) carried out a meta-analysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true.
The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or beta) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 - beta) than if they had flipped a coin, when an experiment effect is of medium size.
Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had “appropriately” employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is scientifically more important than the former paper. For example, if one wants to conduct a meta-analysis to investigate an overall effect in a specific area of study, the latter paper is five times more informative than the former paper. In the long term, statistical significance of particular tests may be of trivial importance (if not always), although, in the short term, it makes papers publishable. Bonferroni procedures may, in part, be preventing the accumulation of knowledge in the field of behavioral ecology and animal behavior, thus hindering the progress of the field as science.
Some of the concerns raised here are partly a criticism of Bonferoni corrections, i.e. in technical terms correcting for family-wise error rate (FWER). It is actually the message that the author wants to convey in his paper. Proponents of controling for false discovery rate (FDR) argue that an investigator shouldn’t be penalized for asking more questions, and the fraction of errors in the answers should be controlled, rather than the absolute value. That said, FDR, while useful, does not answer the problems of publication bias.