Why Statistical Significance Is Often Insignificant

(Bloomberg View) -- The knives are out for the p-. This statistical quantity is the Holy Grail for empirical researchers across the world -- if your study finds the right p-, you can get published in a credible journal, and possibly get a good university tenure-track job and research funding. Now a growing chorus of voices wants to de-emphasize or even ban this magic number. But the crusade against p-s is likely to be a distraction from the real problems afflicting scientific inquiry.

What is a p-? It’s a bit subtle. Suppose that as a researcher, I’m looking for evidence of something interesting -- the effect of a new drug on blood pressure, or the impact of good teachers on student outcomes. Suppose the effect isn’t really there. Even so, there’s still some possibility that random chance will make the data look as if the effect I’m looking for is there. The p- is the probability that I got this sort of false positive. So a low p- means that I found a clear signal of something and that my results probably aren’t just a mirage.

That’s the theory, anyway. In reality, the p-s that researchers report in their scientific papers are a relatively weak way of filtering out false positives. First, a researcher will typically conduct a whole battery of tests -- and yet there’s a good chance that random noise will generate at least one eye-catching p- even if all of the researcher’s hypotheses are wrong. Most researchers don’t correct for this. Even worse, researchers can just keep testing hypotheses until one of them comes out looking interesting, an activity known as p-hacking. As statistician Andrew Gelman has shown, avoiding certain tests after looking at the data can also be a form of p-hacking -- one that’s almost impossible to detect.

Also, p-s receive too much attention in the first place. Just because an effect is detectable doesn’t mean it’s important. If researchers find that red wine is associated with better cardiovascular health, but the benefit they find is tiny, the discovery may get a lot more attention -- and scientific acclaim -- than it deserves. Finally, focusing on p-s ignores a huge array of things other than random noise that could invalidate a study.

Some have blamed the reliance on p-s for the replication crises now afflicting many scientific fields. In psychology, in medicine, and in some fields of economics, large and systematic searches are discovering that many findings in the literature are spurious. John Ioannidis, professor of medicine and health research at Stanford University, goes so far as to say that “most published research findings are false,” including those in economics. The tendency research journals have of publishing anything with p-s lower than 5 percent -- the arbitrary referred to as “statistical significance” -- is widely suspected as a culprit.

For these and other reasons, more and more scientists are urging big changes. In 2016 the American Statistical Association released a widely read statement criticizing the reliance on p-s and urged less emphasis on them. The journal Basic and Applied Social Psychology banned p-s outright. Some economists have suggested moving from a cutoff p- of 5 percent to 0.5 percent.

But banning p-s won’t do much to fix the scientific enterprise. In the absence of this measure, journal editors and reviewers will turn to some other method of determining whether a statistical result is interesting enough to publish. There are other criteria they could use, but they all suffer from the same basic problems.

Some argue that the idea of a single quantity that tells you whether a result is interesting or not is folly -- journal editors and reviewers can only decide if something is interesting by looking at many different quantities, and exercising judgment. This is completely correct. But as things stand, editors and reviewers already have the ability to do this -- they can choose to look not just at p-s, but at the size of effects, or their explanatory power. If they choose not to do so, it’s because editors and reviewers simply don’t have much of an incentive to evaluate a paper’s quality thoroughly before they publish it. Changing statistical reporting conventions won’t fix that basic incentive problem -- it’s peer review, not p-s, that is the weak filter.

Why do editors and reviewers not have an incentive to check findings more carefully? Because universities need professors. The academic system uses research publications as its measure of scholarly quality, even though real and interesting scientific findings are few and far between. So the academic journal system publishes lots of weak and questionable results, because this is the only way we have devised to screen people for academic jobs and tenure.

The truth is, p-s, and even the replication crisis itself, probably aren’t doing much harm to the scientific enterprise itself. Important, interesting findings will still be identified eventually, even if it takes several rounds of replication to confirm that they’re real.

The real danger is that when each study represents only a very weak signal of scientific truth, science gets less and less productive. Ever more researchers and ever more studies are needed to confirm each result. This process might be one reason new ideas seem to be getting more expensive to find.

If we want to fix science, p-s are the least of our problems. We need to change the incentive for researchers to prove themselves by publishing questionable studies that just end up wasting a lot of time and effort.

This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners.

Noah Smith is a Bloomberg View columnist. He was an assistant professor of finance at Stony Brook University, and he blogs at Noahpinion.

To contact the author of this story: Noah Smith at nsmith150@bloomberg.net.

For more columns from Bloomberg View, visit https://www.bloomberg.com/view.

Get live Stock market updates, Business news, Today’s latest news, Trending stories, and Videos on NDTV Profit.

GET REGULAR UPDATES

Why 'Statistical Significance' Is Often Insignificant