Could a Rating System Help Weigh Claims Made in Popular Science Books?

Standing in a powerful pose increases your testosterone levels. Ten thousand hours of practice leads to mastery and high achievement. Eating out of large bowls encourages overeating. These are just a few examples of big ideas that have formed the basis of popular science books, only to be overturned by further research or a closer reading of the evidence.

Stanford University psychologist Jamil Zaki concludes his recent book, “The War for Kindness,” with a twist: an appendix that rates the robustness of the claims he makes.

“Pop psychology is sort of built on this idea of the one true thing,” says Amanda Cook, executive editor at Crown who has worked on many science books. “Good scientists treat the truth as provisional. They know that science is dynamic and the scientific method is going to lead them to new truths or a refinement of truth, but readers want the one true thing, and in pop psych that means the one true thing that will change their lives.”

It’s a tension that Stanford University psychologist Jamil Zaki attempts to address in his recent book, “The War for Kindness: Building Empathy in a Fractured World.” The book is written in the breezy, accessible style typical of pop science bestsellers, but Zaki concludes it with a twist: an appendix that rates the robustness of the claims he makes. The numerical rating system is his attempt to acknowledge that some ideas have more evidence to back them than others, and that some of them might turn out to be wrong. Zaki hopes his system might provide a model for other authors who want to avoid trading in hype.

Psychology is in the midst of a reckoning, as numerous high-profile findings in the field have failed to replicate, or be found again when subsequent researchers attempted to repeat the experiments, Zaki notes. “We psychologists have used this as an opportunity to strengthen our methods, be more transparent about our research process, and clarify exactly what we do and don’t know,” he writes. “In that spirit, I decided that readers should have tools for further evaluating the evidence presented in this book.”

So Zaki hired a Stanford colleague, psychology doctoral student Kari Leibowitz, to conduct an independent review of the evidence behind key claims. She went through each chapter and identified the main claims, and then did what she calls “a miniature literature review” to evaluate the current state of the evidence. She rated each claim on a scale of one to five (from weakest to strongest evidence) and wrote up a rationale for that rating before sending it to Zaki for discussion.

“I didn’t want to influence her scoring,” Zaki says. In a few instances, he pointed her to studies that she had overlooked, or offered other lines of evidence she hadn’t considered, but more often a low rating provoked him to either remove the claim from the book or put more cautious language around it. “If he thought the claim wasn’t strong enough, he’d go back to make that clearer in the text,” Leibowitz says.

Leibowitz sought to evaluate the claims in an unbiased manner, but she faced tricky decisions at every turn. It wasn’t feasible to rate every single claim in the book, and so she and Zaki had to choose which ones to highlight. Although they’d laid out some overarching standards for classifying evidence, doing so also required multiple judgment calls. “In general, to be rated a five there had to be dozens of studies on a given claim, often evidenced by many review papers and/or meta-analyses,” Leibowitz says.

“In that spirit, I decided that readers should have tools for further evaluating the evidence presented in this book.”

Something rated a four had very consistent results, but none or very few meta-analyses to back it up. A three rating meant there were only a handful of studies to support the claim or there was disagreement about it in the literature. For example, she says that there was a lot of evidence to support the claim that “people who undergo intense suffering often become more prosocial as a result.” But there was also a lot of evidence supporting the opposite, that violence begets violence and suffering can make people cruel or abusive, so this claim got a three.

She weighted failed replications as she did successful replications, “as individual pieces of evidence.” If there were a lot of them, it would lower the score, but if there were dozens of studies in support of a given claim, one or two failed replications would usually only move a claim from a five to a four, she says.

Leibowitz rated 51 claims in the book and spent more than 100 hours doing it. “We did the best possible job of giving this relative overview for each of these claims and giving a rationale for what constitutes really strong evidence in our minds,” she says. On the book’s website, readers can download the spreadsheet of source material Leibowitz used to evaluate the claims.

As well-meaning as this approach is, there’s a limit to how objective it can be. The process was filled with subjective calls, from which claims to check to how much weight to give each study. Many of the claims were very broad, such as “empathetic individuals excel professionally” and “mindfulness increases caregiver empathy,” which could be interpreted in different ways depending on how these ideas were defined.

Simine Vazire, a psychologist at the University of California, Davis and co-founder of the Society for the Improvement of Psychological Science, says she worries that Zaki’s rating system appears to take published studies at face value, “which seems dubious in light of what we know about questionable research and publication practices,” she says. In Vazire’s view, “it basically equates a peer-reviewed publication with a dose of evidence, which kind of reifies the idea that peer review is a good indicator of what’s solid evidence. The whole point of the replicability crisis is that this isn’t the signal we thought it was.”

There’s also the possibility that a rating system could be gamed. “My instinct is to say that they’re on the right track with something like this, but there are so many ways for anything to be misused,” says Slate journalist Shannon Palus, who’s done fact-checking for magazines like Discover, Popular Science, and Quanta. “It’s easy to overstate the quality of the evidence.”

Palus worries that this kind of claim rating can become a “performative sifting through the evidence” intended to give certain claims credibility, rather than to find out whether they’ve earned it. It’s a tactic she’s seen employed by advocacy organizations like the Environmental Working Group, whose food rating system is aimed at helping “consumers make healthier, greener food choices,” and companies like Care/of, which sells vitamins and supplements online with ratings assuring consumers about their effectiveness.

“My instinct is to say that they’re on the right track with something like this, but there are so many ways for anything to be misused.”

Her concerns are shared by Andrew Gelman, a statistician at Columbia University who has been a vocal critic of overstated pop science. “It sounds like here their purpose really is to assess the evidence, that’s good,” he says. The key question, he says, is whether authors using this kind of system are coming at the work with a critical eye, or just looking for a stamp of approval to say that “everything’s OK.”

Rating evidence requires nuance, he says. “A published paper makes a lot of claims,” he says, explaining that “often there will be one part that’s reasonable and other parts that aren’t.”

Readers seem to appreciate the ratings. One Goodreads reviewer posted: “The fact that the author devoted nine pages to rating the claims he’d made and explaining his rationale for including claims for which evidence is not robust filled my heart with joy.” Another wrote, “rarely do I read a book that provides this type of breakdown of his claims. I wish all books did this to be honest.”

But how much attention the average reader will pay to the ratings is anyone’s guess. The rating system has limited value if no one uses it to update their own beliefs, and it’s hard to know how many readers will really examine the appendix.

Zaki and Leibowitz hope other authors will take up some kind of evidence rating system. “My vision and my dream for this is that this will be just the start and other people will take this idea and run with it and improve on it and that this will become the standard for this kind of book,” Leibowitz says.

Cook, who edited “The War for Kindness,” appreciated that the claim-checking process shaped what Zaki put in the text. She says she would be open to having her other authors do something like this, but that it would have to be their own impulse. “A sort of half-hearted version of this wouldn’t be very valuable.”

Most of Cook’s authors now hire fact-checkers. “That was absolutely not the case even five years ago,” she says, but the truth “seems more urgent” in this “post-fact world.”

In today’s media environment, errors can become trending hashtags in a matter of minutes. And if you get caught making an error that undermines your book’s big idea, “it can destroy your reputation,” Cook says. As an example, she points to what happened to author Naomi Wolf recently when, during a live interview, a BBC radio host pointed out that in her new book, “Outrages: Sex, Censorship, and the Criminalization of Love,” she had misunderstood the meaning of a term in archival legal documents that was crucial to her thesis. Wolf’s publisher canceled the U.S. release of the book.

Publishers don’t normally pay for fact-checking, so most authors have to pay out of their own pockets. Add to that the cost of doing a claim check, and the total bill could easily reach five figures, which would be beyond the means of most authors.

Ultimately, the most important outcome of Zaki and Leibowitz’s claim rating exercise may be that it forced Zaki to give extra consideration to the strength of his claims. It’s a worthy step that also points to what may be the most limiting aspect of his methodology. The people who are most worried about making overhyped claims are probably the ones who are least guilty of engaging in it, says Jane C. Hu, a Seattle-based journalist and freelance fact-checker who has fact-checked numerous science books.

“If you want to cash in your credentials to write a book that you’re going to make a bunch of specious claims in,” she says, “you’re probably not the same kind of person who is going to go through the painful process of hiring a fact-checker to have them go through it.”

Christie Aschwanden is an award-winning science journalist. She’s the author of “Good to Go: What the Athlete in All of Us Can Learn from the Strange Science of Recovery” (Norton) and co-host of the podcast “Emerging Form.” Find her on Twitter at @CragCrest.

Christie Aschwanden is an award-winning science journalist. She’s the author of “Good to Go: What the Athlete in All of Us Can Learn from the Strange Science of Recovery” (Norton) and co-host of the podcast “Emerging Form.” Find her on Twitter at @CragCrest.