I came across an article by Regina Nuzzo in Nature today, Statistical Errors: P-values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume
The gist of the article is as follows: John Ioannidis suggested in 2005 that most published findings are false. That is not a comforting thought, but it is not ungrounded.
The P-value was introduced by Fisher as an informal way to judge whether evidence was worthy of a second look. Other statistical methods, for example the framework developed by Neyman and Pearson, pointedly left out the P-value completely.
Ideas from Fisher, Neyman and Pearson were mixed together by scientists creating manuals of statistical methods. Now, a low P-value is considered a stamp of approval that a result represents reality. The more unlikely your hypothesis, the greater the chance that an exciting finding is a false alarm, even if your P-value is miniscule.
A small P-value can also disguise the relevance of a result. Nuzzo cites a study that compared marital satisfaction of couples who met online to couples that met more traditionally. Although the P-value was very small (p < 0.002), the effect was very small – divorce rate changed from 7.67% to 5.96%, and the study’s happiness metric showed a change from 5.48 to 5.64 on a 7-point scale.
Another issue is termed P-hacking, where authors practice bad science by trying multiple methods until a “significant” P-value is found.
There are a few many ways to get around these problems. For example, reporting methods carefully – reporting how the sample size was determined, data exclusions (if any), and data transformations – helps; this is a fairly common practiced.
Another idea is two-stage analysis – essentially performing cross validation in your experimental design. Researchers following this method would perform a few exploratory studies on small samples in order to come up with hypothesis, and publish a report on this stating their intentions (perhaps on their own website, or in a database like the Open Science Framework). Then they would replicate the study themselves, and publish both results together.
The article reminds me of a few arguments I’ve heard about the merits of Bayesian statistics versus frequentist statistics. It seems obvious to me that in order to do quality science, prior knowledge is necessary to interpret results.
The ideas in this article are not new. Here is an older, more detailed article: The P-value fallacy