So You Think Peer Review Works. Wanna Bet?

(Bloomberg Opinion) -- When they’re asked to make bets, social scientists have a surprising knack for predicting which of their colleagues’ papers are full of baloney. This is in contrast to the notoriously bad job social scientists have apparently been doing evaluating papers through traditional peer review.

In a series of attempts to test the health of the field, a high proportion of peer-reviewed published findings keep failing to hold up to attempts at replication. One of the most worrisome was a systematic look at 100 psychology studies, in which more than half failed to replicate.

It would appear, then, that researchers in the social sciences know what passes the “sniff test” and what doesn’t. But the peer review process is not capitalizing on their ability. The betting approach, explored in a study published Tuesday in Nature Human Behavior, started with its own attempt to replicate a series of high-profile findings. This time, there were 21 results examined, all of which had been published in the two highest-impact journals — Science and Nature. Using larger sample sizes, researchers were able to replicate 13, though for some of those, the effect size was only about half of that originally claimed. For the other eight, replication efforts showed no effect.

The researchers then presented social scientists with a prediction market — a chance to buy shares that would yield a return depending on whether a given result was replicated or not. With that system, using groups of 18 to 80 peers, a majority accurately predicted all 13 studies that did replicate. They accurately predicted five of those that weren’t replicated, and gave about 50-50 odds to the other three that weren’t replicated.

Some of the papers that failed replication altogether were, on the face of things, proposing preposterous claims. One concluded that reading literature for three minutes would make you more empathetic. Another, that holding a heavier clipboard made people predisposed to want to hire a job candidate. One claimed that washing your hands will make you feel less guilty about past moral transgressions. (All were published in Science.) These were passed off as reliable based on “statistical significance tests,” now known to be flawed or easily misused.

An accompanying opinion piece lamented that we can’t reliably conclude that reported findings are true. That’s not quite the point, though. Scientific findings aren’t supposed to be “true” in any final, dogmatic way. But there is a line between good and bad science. High-quality science is enlightening and honest; it may take a stumbling step closer to truth. Bad science draws conclusions based on wishful thinking and error (and much more rarely, cheating), and sometimes leads people to believe things that are, in hindsight, kind of silly.

But some fields are much quicker than others to correct the record. Experts’ bunkum detectors went on full alert when researchers claimed that particles were moving faster than the speed of light, when NASA announced Martian bacteria in a meteorite, and following a vague but spectacular claim about arsenic-based life. All were quickly double checked and debunked. 

The trouble with social science, and some areas of medicine, is that even the most bizarre and high-profile claims had simply remained in the literature, unchecked. In the past, the press could act as a second filter — as it did with that Mars meteorite claim back in the 1990s. But in this century, there’s a perceived demand for instant news in the form of zippy, sharable nuggets. When the claim of arsenic-based life cropped up in 2009, wild public interest had crested by the time the more skeptical accounts rolled in.

Many of the irreproducible social science papers reinforce a common theme: that human behavior is buffeted around by seemingly irrelevant stimuli, and that this happens in some sort of systematic, measurable way. Because such claims seem so weird, and also confirm a popular assumption that human minds are weird, they tend to generate big publicity. The studies that did worst in the prediction market got glowing international coverage.

Before this latest replication effort, there were different schools of thought about the relative reliability of the highest-impact journals, said Caltech economics professor Colin Camerer, who led the work. On the one hand, these journals might use higher-quality peer review. On the other hand, Science and Nature are both known to favor cute, surprising findings, though in the past such results have often turned out to be wrong.

The answer was that the top journals made a somewhat poor showing in social science, with a 60 percent replication rate, but even that was better than what came out of more general replication studies. The prediction market part of the study, however, showed that people can apply critical thinking to their peers. With traditional peer review, papers in these top journals go to three people chosen for relevant expertise. They evaluate the studies for originality, experimental design and other factors.

The prediction market harnessed a kind of crowd wisdom, and focused entirely on the question of whether a finding would hold up to replication. Of course, social scientists in particular might have wised up a bit after all other analyses of studies led to what’s become known as a “replication crisis.” But the prediction market success might have some value in the hard sciences too. Claims may get a better vetting when they’re evaluated by a bigger crowd, when there’s a focus on predicting replication, and when people are asked to put their money where their expertise is.

This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners.

Faye Flam is a Bloomberg Opinion columnist. She has written for the Economist, the New York Times, the Washington Post, Psychology Today, Science and other publications. She has a degree in geophysics from the California Institute of Technology.

©2018 Bloomberg L.P.