P-Hacking And P-Diligence In Psychological Research : 13.7: Cosmos And Culture High-profile failures to replicate classic psychology experiments have made the news. Common research practices are under attack. Commentator Tania Lombrozo suggests a way forward.
NPR logo Science, Trust And Psychology In Crisis

Science, Trust And Psychology In Crisis

When I attended my first scientific conference at the tender age of 20, one of my mentors surprised me with the following bit of advice. Transcribed directly from memory:

"You should be sure to attend the talk by so-and-so. You can always trust his results."

This casual remark made a deep impression on me. What did trust have to do with anything? This was supposed to be science! Based on evidence! It shouldn't have mattered who performed the experiment, who delivered the talk or whose name was on the ensuing publication.

An illustration of a dog tied up to a pole topped with a sign featuring the letter "P."

As my training in experimental psychology advanced, I encountered the same idea in various forms. Some findings were taken more seriously than others, usually based on which lab produced them. And it wasn't simply a matter of prestige, like the quality of the journal in which a paper was published, or how famous the authors were. It also wasn't about friendship and it was rarely about ideology. It was a more basic form of trust in the quality and soundness of the research.

This notion of trust didn't stem from fears of fraud or deception. When a result was approached with some skepticism, it wasn't that data fabrication was ever suspected, or that anyone assumed nefarious intent on the part of the scientists involved. So it took some personal experience conducting research and going through the publication process before I had a good sense for what was going on.

And here's what I learned: There's a gap between what you get in a polished scientific presentation or publication and actual scientific practice — the minute details of what happens in the preparation, execution, analysis and reporting of every study. And that gap can be traversed with more or less diligence and care.

The gap between practice and publication is one reason psychology is embroiled in what some are calling a "replication crisis" — a lack of confidence in the reality of many published psychological results.

It doesn't help that there have been some high-profile cases of alleged scientific misconduct within psychology (like those of Marc Hauser and Diederik Stapel), as well as some failures to replicate the results of classic experiments. But the more troubling issues for the field are pervasive but problematic practices that could support an overabundance of false positives: statistically significant results that make it into the scientific literature, but that don't reflect real psychological phenomena.

How could this happen?

The short answer is that such "findings" sneak in through the gap between practice and publication — especially in the analysis and selective reporting of data.

One of the most common practices to undergo recent scrutiny is so-called "p-hacking," which an article at Nature news includes in a list of statistical muddles encouraged by psychologists' reliance on "p-values," the standard criterion used to evaluate whether a particular pattern of data is likely to result from chance alone:

"Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. 'P-hacking,' says Simonsohn, 'is trying multiple things until you get the desired result' — even unconsciously. It may be the first statistical term to rate a definition in the online Urban Dictionary, where the usage examples are telling: 'That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05,' and 'She is a p-hacker, she always monitors data while it is being collected.'"

To illustrate how p-hacking can occur, suppose you conduct a study to test the hypothesis that people who regularly listen to This American Life are better at considering multiple points of view than those who don't regularly listen to the program. The best practice is to make most decisions about the study in advance: how many people will participate, whether and why you'll exclude some participants, how you'll define a "regular" listener, and so on. If these decisions are instead made after you've looked at some or all of the data, it's tempting to make the decisions in a way that favors what you expect to find or that supports some other sexy and statistically significant result.

For example, suppose it turns out that weekly This American Life listeners — but not biweekly listeners — do significantly better than others on measures of perspective-taking. That could influence your decision about how to define a "regular" listener. And the influence might be a subtle one. It's not that you set out to define terms in your favor but rather that you look at some data and think: "Of course a regular listener is a weekly listener!" If it turns out that your effect is present for men, but not for women, you might come up with post-hoc reasons why splitting up the data by gender made sense all along.

The more ways you look at the data, the greater the potential to find something with a p-value that meets the criterion of being "less than .05," the magic threshold that governs what psychologists typically consider statistically significant. So without correcting for these choices in some way, you'll inflate the odds of generating a false positive. And because papers rarely include a list of all analyses that didn't pan out in addition to those that did, peer reviewers can only guess at what happened between practice and presentation. That is, they're left to fill in the gap with their own charitable or uncharitable assumptions.

So there's a lot more to assessing the quality of a finding than the p-value reported in a paper. And that's why reputation and trust — "peer review" in a broad sense — can play a role in scientific judgments.

Problems like p-hacking are one of the main reasons some psychologists are calling for the "pre-registration" of studies, with predictions and analyses recorded before data collection and analyses begin. Doing so reduces the kind of flexibility that enables p-hacking and it also reduces the gap between practice and publication.

But pre-registration also has critics who worry that it puts "science in chains." Some, like Uta Frith and Chris Frith, fear "the creeping in of increasingly rigid rules and regulations":

"Do we want the type of regulations that are in place for ethics applications for psychological experiments? Regulations can seriously delay scientific projects, and yet cannot prevent other cases of bad practice."

So here's another approach: Let's accept the role of trust, and advance the role of training.

Consider the flipside of p-hacking — what I'll call "p-diligence."

Researchers who engage in p-diligence are those who engage in practices — such as additional analyses or even experiments — designed to evaluate the robustness of their results, whether or not these practices make it into print. They might, for example, analyze their data with different exclusion criteria — not to choose the criterion that makes some effect most dramatic but to make sure that any claims in the paper don't depend on this potentially arbitrary decision. They might analyze the data using two statistical methods — not to choose the single one that yields a significant result but to make sure that they both do. They might build in checks for various types of human errors and analyze uninteresting aspects of the data to make sure there's nothing weird going on, like a bug in their code.

If these additional data or analyses reveal anything problematic, p-diligent researchers will temper their claims appropriately, or pursue further investigation as needed. And they'll engage in these practices with an eye toward avoiding potential pitfalls, such as confirmation bias and the seductions of p-hacking, that could lead to systematic errors. In other words, they'll "do their p-diligence" to make sure that they — and others — should invest in their claims.

P-hacking and p-diligence have something in common: Both involve practices that aren't fully reported in publication. As a consequence, they widen the gap. But let's face it: While the gap can (and sometimes should) be narrowed, it cannot be closed.

Some underspecification is inevitable. There's a reason it takes years of training to become a good scientist and why people learn to do science through what's essentially an apprenticeship system, not just by memorizing a list of best practices.

Some underspecification is even desirable. You don't want to read a 200-page tome detailing how a scientist debugged the code used to run her experiment or an appendix with the dozens of histograms she stared at. Much better to trust that she knows what she's doing and that her research team has good systems in place to catch errors.

The upshot is that it might not be realistic — or altogether desirable — to close the gap. And so we can't eliminate the role of trust. What we can do is train scientists to traverse it with greater care — with greater p-diligence — and to create a scientific infrastructure that allows them to do so.

We can practice what Uta Frith and Chris Frith, in a post last week at The Guardian, call "slow science." And if there's a role for regulation and registration, it might not be for the cases in which we must trust our colleagues but for those in which we don't trust ourselves.

A lot of good things have come out of psychology's replication crisis. Psychology should take a critical look at common practices and adopt more conservative and sophisticated approaches to statistical analysis. Norms for scientific presentation and publication should become more accommodating of how research actually unfolds, including how and when various decisions are made. Data and experimental stimuli should be made available to other researchers. And efforts at replication should be supported, which means they have to be valued by journals, by hiring and tenure committees, and by funding agencies.

But if I had the opportunity to check in with my 20-year-old self, just starting to appreciate the subtle roles of trust in science, I would tell her not to worry.

Science sometimes gets things wrong. Scientists often get things wrong. But what makes science so powerful is how it responds to new evidence and how scientists learn from their mistakes. Psychology's current crises are growing pains, no doubt unpleasant, but likely to result in a better, more robust science. And in that future psychological science there will still be a role for trust, but that trust will be better placed.

You can keep up with more of what Tania Lombrozo is thinking on Twitter: @TaniaLombrozo