Why 'Statistical Significance' Is Often Insignificant

(Bloomberg View) -- The knives are out for the p-value. This statistical quantity is the Holy Grail for empirical researchers across the world -- if your study finds the right p-value, you can get published in a credible journal, and possibly get a good university tenure-track job and research funding. Now a growing chorus of voices wants to de-emphasize or even ban this magic number. But the crusade against p-values is likely to be a distraction from the real problems afflicting scientific inquiry.

What is a p-value? It’s a bit subtle. Suppose that as a researcher, I’m looking for evidence of something interesting -- the effect of a new drug on blood pressure, or the impact of good teachers on student outcomes. Suppose the effect isn’t really there. Even so, there’s still some possibility that random chance will make the data look as if the effect I’m looking for is there. The p-value is the probability that I got this sort of false positive. So a low p-value means that I found a clear signal of something and that my results probably aren’t just a mirage.

That’s the theory, anyway. In reality, the p-values that researchers report in their scientific papers are a relatively weak way of filtering out false positives. First, a researcher will typically conduct a whole battery of tests -- and yet there’s a good chance that random noise will generate at least one eye-catching p-value even if all of the researcher’s hypotheses are wrong. Most researchers don’t correct for this. Even worse, researchers can just keep testing hypotheses until one of them comes out looking interesting, an activity known as p-hacking. As statistician Andrew Gelman has shown, avoiding certain tests after looking at the data can also be a form of p-hacking -- one that’s almost impossible to detect.

Also, p-values receive too much attention in the first place. Just because an effect is detectable doesn’t mean it’s important. If researchers find that red wine is associated with better cardiovascular health, but the benefit they find is tiny, the discovery may get a lot more attention -- and scientific acclaim -- than it deserves. Finally, focusing on p-values ignores a huge array of things other than random noise that could invalidate a study.

Some have blamed the reliance on p-values for the replication crises now afflicting many scientific fields. In psychology, in medicine, and in some fields of economics, large and systematic searches are discovering that many findings in the literature are spurious. John Ioannidis, professor of medicine and health research at Stanford University, goes so far as to say that “most published research findings are false,”  including those in economics. The  tendency research journals have of publishing anything with p-values lower than 5 percent -- the arbitrary value referred to as “statistical significance” -- is widely suspected as a culprit.

For these and other reasons, more and more scientists are urging big changes. In 2016 the American Statistical Association released a widely read statement criticizing the reliance on p-values and urged less emphasis on them. The journal Basic and Applied Social Psychology banned p-values outright. Some economists have suggested moving from a cutoff p-value of 5 percent to 0.5 percent.  

But banning p-values won’t do much to fix the scientific enterprise. In the absence of this measure, journal editors and reviewers will turn to some other method of determining whether a statistical result is interesting enough to publish. There are other criteria they could use, but they all suffer from the same basic problems.

Some argue that the idea of a single quantity that tells you whether a result is interesting or not is folly -- journal editors and reviewers can only decide if something is interesting by looking at many different quantities, and exercising judgment. This is completely correct. But as things stand, editors and reviewers already have the ability to do this -- they can choose to look not just at p-values, but at the size of effects, or their explanatory power. If they choose not to do so, it’s because editors and reviewers simply don’t have much of an incentive to evaluate a paper’s quality thoroughly before they publish it. Changing statistical reporting conventions won’t fix that basic incentive problem -- it’s peer review, not p-values, that is the weak filter.

Why do editors and reviewers not have an incentive to check findings more carefully? Because universities need professors. The academic system uses research publications as its measure of scholarly quality, even though real and interesting scientific findings are few and far between. So the academic journal system publishes lots of weak and questionable results, because this is the only way we have devised to screen people for academic jobs and tenure.

The truth is, p-values, and even the replication crisis itself, probably aren’t doing much harm to the scientific enterprise itself. Important, interesting findings will still be identified eventually, even if it takes several rounds of replication to confirm that they’re real.

The real danger is that when each study represents only a very weak signal of scientific truth, science gets less and less productive. Ever more researchers and ever more studies are needed to confirm each result. This process might be one reason new ideas seem to be getting more expensive to find.

If we want to fix science, p-values are the least of our problems. We need to change the incentive for researchers to prove themselves by publishing questionable studies that just end up wasting a lot of time and effort.

This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners.

Noah Smith is a Bloomberg View columnist. He was an assistant professor of finance at Stony Brook University, and he blogs at Noahpinion.

To contact the author of this story: Noah Smith at nsmith150@bloomberg.net.

For more columns from Bloomberg View, visit http://www.bloomberg.com/view.

©2017 Bloomberg L.P.