Suppose you're a 45-year-old woman living in the U.S. You have no history of breast cancer, nor worrisome symptoms. Should you have a mammogram?
If you follow the American Cancer Society's recommendation, the answer is "yes": You should begin routine mammography screening for breast cancer at age 45. But if you follow the U.S. Preventive Services Task Force recommendation, the answer is "no": You should probably wait another 5 years.
Which recommendation is right?
Answering this question goes well beyond the evidence. It's not just that the evidence is imperfect (though it always is). Rather, it's that turning evidence into policy requires more than empirical facts; it also depends on values. In this case, recommendations could differ depending on how much value is placed on avoiding "false alarms" — cases in which a mammogram leads to subsequent tests or treatments that are ultimately unnecessary — versus the value of preventing "misses," or cases in which a cancer that could have been caught by routine mammography slips by.
Christie Aschwanden discusses this example in a nice post at FiveThirtyEight. The headline summarizes the key lesson: "Science Won't Settle The Mammogram Debate."
But the lesson is a much more general one: Science can (and should) inform most policy decisions, but science, on its own, won't settle policy. The need to trade off different kinds of errors is one pervasive reason why. Science can quantify the relative risks, but it can't ultimately answer the question of how much we value avoiding a false alarm relative to preventing a miss.
Consider an example from another domain: eyewitness identification.
You unwittingly observe a crime, and you get a fleeting glimpse of the perpetrator. The police later show you a photograph of a suspect: Is it the person you saw?
There are two different ways you might go wrong. First, you could wrongly say "yes," because the person in the photograph is not the one you saw. If you make this kind of error — a false alarm — you risk contributing to the conviction of an innocent person. But if you wrongly say "no" — a miss — a dangerous criminal could go free.
How confident do you need to be in your judgment before you're willing to say "yes"? The answer depends not only on the quality of your memory, but also on how you weight the risk of a false alarm against the hazards of a miss.
Decisions about how to weight these errors of eyewitness identification arise at a policy level as well, not only for individuals tasked with making identifications. For instance, there's currently some debate about whether it's best to present eyewitnesses with a lineup, in which several individuals are visible simultaneously, or to have them make yes/no decisions about individuals sequentially. Although eyewitnesses' total accuracy is likely similar across these presentation modes, the kinds of errors they're more likely to make may differ. If that's right, policy decisions about which presentation style to adopt will implicitly value one type of error-avoidance over the other.
More mundane examples abound.
The weather forecast on Monday morning predicts a 20 percent chance of rain. Do you take an umbrella? Avery, who hates to carry an umbrella unnecessarily and who doesn't really mind getting wet, doesn't bother. Blake, who loves accessories and dry clothes, opts for an umbrella and a raincoat. Avery and Blake don't disagree about the probability of rain; instead, they weight the costs of different errors differently.
Carter and Dakota decide to bake chocolate chip cookies. When the timer goes off after a few minutes, they peer at the cookies uncertainly. Should they pull them out now and risk having a batch of under-baked cookies? Or wait another five minutes and, instead, risk a batch of over-baked cookies? Carter, who hates cookies to be doughy, opts for waiting. Dakota, who hates cookies that are dry, wants to take them out now. They don't disagree about the relative risks; they disagree about which risk — under- or over-baking — is more grievous.
A final example comes from science itself.
When scientists test hypotheses using statistical tests, they typically decide on some value beyond which a test is considered statistically "significant." For instance, the norm in psychology is to consider a statistical test significant when it yields a p-value below 0.05, which means that the probability of seeing results at least as extreme due to chance is less than 5 percent. Psychology could become more conservative, however, by lowering this critical threshold, making it less likely that the field reports false alarms: "significant" results that are just statistical flukes. But doing so would also increase the probability of misses: failures to recognize real effects. Scientific "policies" — just like policies in other domains — effectively value some kinds of error-avoidance over others.
Recognizing that science can't settle policy doesn't open the floodgates to skepticism or radical relativism. It doesn't mean we should give up on evidence-based policy, or that "anything goes." Instead, it invites us to recognize our values and subject them to scrutiny.
It's not enough to have solid evidence; we also need solid values. And that requires careful science, but also careful thought.
Tania Lombrozo is a psychology professor at the University of California, Berkeley. She writes about psychology, cognitive science and philosophy, with occasional forays into parenting and veganism. You can keep up with more of what she is thinking on Twitter: @TaniaLombrozo