Pop-Science Psychology Books are Untrustworthy

I recently wrote a post outlining how the popular book Glow Kids undermined its credibility with a fraudulent data. I also examined an overstatement by New York Times best-selling author Robert Cialdini in his most recent book.

Pop-psychology needs to be approached with a high degree of skepticism. Poor statistical techniques, biased framing, and hyperbolic misinterpretation are endemic to psychology's popular science press. In this post, I've collected a few other famous examples of popular science gone wrong.

Why We Sleep

One of my MIT mentors suggested I read Dr. Matthew Walker's 'Why We Sleep?' as part of my research. It's a New York Times Bestseller; Bill Gates had this to say:

Why We Sleep is an important and fascinating book…Walker taught me a lot about this basic activity that every person on Earth needs. I suspect his book will do the same for you.

With this book, Matthew Walker launched himself into the limelight; he's appeared on 60 Minutes, Nova, BBC, NPR, and has glowing reviews across all the major news outlets. He is single-handedly shaping the discourse on sleep, and shaping the behavior of thousands of readers as well.

Why We Sleep was then picked apart very thoroughly by Alexey Guzey. His post focuses on just fact-checking the first chapter, which he calls 'riddled with scientific and factual errors.' The review is thorough, and I highly recommend reading it (Andrew Gelman recommends it too). He points to outright academic misconduct in the way data is presented which UC Berkeley has ignored.

While Walker was forced to issue some corrections, this book is still out there. Moreover, a Google search of 'Why We Sleep' doesn't include any criticism on the first page of results; only testimonials to the power of this work.

Thinking Fast and Slow

Another book– which I actually loved– is Daniel Kahneman's Thinking Fast and Slow. Daniel Kahneman has done a lot of great scholarship in my opinion; unfortunately, he had to retract the entire fourth chapter of his book, on social priming, after publishing an open letter in Nature warning of a looming 'train-wreck' for the field when some of the core findings failed to replicate.

This was largely in response to an analysis done by Ulrich Schimmack, who runs one of the best blogs on the replication crisis. Just last year he published a further critique of Thinking Fast and Slow (2020), in which he analyzed each chapter with his 'R-Index' (a score which assesses the likelihood of replication based on the power of each cited study, though this is highly variable when used to analyze just a small numbers of studies). The results are pretty bad. Out of the 13 chapters analyzed, the majority (seven chapters) fall below a 50% likelihood of replication (on average for the studies in that chapter); the other six vary widely, mostly hovering in the fifties and sixties. Those odds are not great. Schimmack summarizes:

[Kahneman's] thoughts are based on a scientific literature with shaky foundations. Like everybody else in 2011, Kahneman trusted individual studies to be robust and replicable because they presented a statistically significant result. In hindsight it is clear that this is not the case. Narrative literature reviews of individual studies reflect scientists’ intuitions (Fast Thinking, System 1) as much or more than empirical findings. Readers of “Thinking: Fast and Slow” should read the book as a subjective account by an eminent psychologists, rather than an objective summary of scientific evidence. Moreover, ten years have passed and if Kahneman wrote a second edition, it would be very different from the first one. Chapters 3 and 4 would probably just be scrubbed from the book.

It is worth reading Kahneman's direct response and acceptance of Ulrich's reporting in his 2017 post 'Reconstruction of a Train Wreck: How Priming Research Went off the Rails.' This is, of course, slightly ironic, as Kahneman calls out underpowered research specifically in his book as part of a discussion on 'the law of small numbers'; Kahneman acted admirably, though, in accepting and supporting the critique.

Before You Know It

'Before You Know It: The Unconscious Reasons We Do What We Do' is written by John Bargh, the father of the notion of 'social priming', which has failed to replicate. (Two of his most famous studies – that warm beverages make you act with warmth, and that read words associated with aging make you walk slower– are now debunked.) This book has also been covered by Ulrich Schimmack in an excruciating detailed, wonderful post. He estimates the replicability of the cited studies from each chapter based on their power:

The more important question is how many studies would produce a statistically significant result again if all 400 studies were replicated exactly. The estimated success rate in Figure 1 is less than half (41%). Although there is some uncertainty around this estimate, the 95% confidence interval just reaches 50%, suggesting that the true value is below 50%. There is no clear criterion for inadequate replicability, but Tversky and Kahneman (1971) suggested a minimum of 50%. Professors are also used to give students who scored below 50% on a test an F. So, I decided to use the grading scheme at my university as a grading scheme for replicability scores. So, the overall score for the replicability of studies cited by Bargh to support the ideas in his book is F.

The best performing chapters are chapter 10, where 62% of the studies should replicate if done exactly, and chapter 6, where this is true of 57% of the studies. All other chapters have estimated replicability of <50% (a low of 13% appears in Chapter 3).

Keep in mind this is just based on the statistical power of the study– we're estimating if the same study was repeated exactly, how likely it would retain a 'significant' result. This is different than whether an effect is real and meaningful or not. Other studies can be done with much larger sample sizes (what we typically mean by replication in social science), and many of the concepts may well be disproven already as a result. Marginal, meaningless effects can replicate 'significantly' if the study had a large enough sample size to capture them. Schimmick's critique is really about methodology– the trust we can put in these specific set of papers agnostic of other information.

Glow Kids

I've written a more extensive essay on this particular book; Kardaras mistakenly cites some purely fabricated research within it (the study's author is the subject of a criminal complaint for fraud). It's unfortunate that this 'research' was featured in Glow Kids– a book whose premise I really appreciate.

Behave

Sapolsky was prominently featured on a Radiolab episode questioning free will; in it, he cited a famously incorrect study on judges. This study also features prominently in his book, Behave. The study suggests that hunger drives favorable parole verdicts down from 65% after a break to 0% just before one. Sapolsky interprets:

What's interesting about that? Number one, the biology makes perfect sense. What are you doing there when you are a judge trying to judge somebody from a completely different world from you to reach a point of deciding. There's mitigating fact. You're trying to take their perspective. You're trying to think about the indirect ways that let -- you're using your frontal cortex. And when you're hungry and your frontal cortex isn't working as well, it's easier to make a snap emotional judgment: this person's rotten. The second amazing thing which exactly addresses this issue is, you get that judge two seconds after they made that decision, you sit him down at that point and say "Hey, so why did you make that decision?" And they're gonna quote, I don't know, Immanuel Kant or Alan Dershowitz at you. They're going to post-hoc come up with an explanation that has all the pseudo-trappings of free will and volition, and in reality it's just rationalization. It's totally biological.

Of course, it turns out the cases were actually ordered by severity of offense and this idea that hunger drives a judge's decision has no basis in reality. Daniel Lakins points out the absurdity of this finding– "if hunger had an effect on our mental resources of this magnitude," he writes, "our society would fall into minor chaos every day at 11:45".

Positivity

Positivity is famous as the book in the positive psychology movement, written by Barbara Erickson. She has faced significant criticism for her views on positivity– her famous suggestion that a golden 2.9013 to 1 ratio of positive to negative emotions separates those that flourish from those that languish was publicly debunked, and forced a retraction of a chapter of this book; despite that, Fredrickson continues to defend the premise that there is a tipping-point ratio of positive to negative emotions that tips people between the two outcomes (and continues to be criticized for it).

Humanistic psychologists have been skeptical of the oversimplifications of positive psychology, and Fredrickson's conception of fulfillment is one major example. More positive feelings don't seem to directly lead to fulfillment. Many deep thinkers would actually suggest that voluntary, efficacious self-sacrifice is quite important instead.

It is quite unlikely that all humans across all stage in life and all contexts will obey a rule. It also directly contradicts pretty well-established notions of hedonic adaptation– that we don't spiral when confronted with a temporary state of bad or good emotions. People are actually quite robustly adaptive.

These kind of simplifications– without empirical data to support them– push positive psychology further from its intended purpose, and do more harm than good for people that are in search of meaning in their lives.

Stumbling on Happiness

I really enjoyed Dan Gilbert's Stumbling on Happiness; Gilbert is a talented and well-read author, and I think his work stands out as an interesting synthesis of many ideas. Unfortunately, Gilbert falls into many of the same traps as the others; his reporting of underlying research seems fraught with mistakes and difficult to trust. Gilbert incorrectly cites the debunked social priming research, and particularly mischaracterizes research on agency and alexithymia in ways that are pretty detrimental to his argumentation. I've written more extensively about the issues I have with some of his characterizations of the underlying research here if you're interested to learn more.

Counterclockwise

Harvard's Ellen Langer features in Dan Gilbert's work as proof that agency matters for longevity, though her research doesn't seem to offer empirical support for that conclusion (as discussed in my post on Stumbling on Happiness). Some of Langer's work has been called out in quite a bit of detail by James Coyne and Andrew Gelman. Langer's book also suffers from several small references to debunked social priming research.

Instead of rehashing the mistakes shared with other examples, we'll take a look at the thrust of the book, which focuses on its namesake– the Counterclockwise Study.

In 1979 Harvard's Ellen Langer took sixteen men in their 70s and 80s, and had them go on a retreat where the environment and media were designed to exactly replicate life 20 years earlier. Half of the men were told to live as if it was the current day– they discussed events from two decades prior as if they were unfolding and could not reference anything in their lives after the events. The other half were told to reminisce with each other about the earlier era they were re-experiencing.

She reports that both groups had hearing, memory, height, weight, gait, posture, and grip strength improvements; they both also 'looked younger' as judged by blinded raters. The experimental group "showed greater improvement on joint flexibility, finger length (their arthritis diminished and they were able to straighten their fingers more), and manual dexterity. On intelligence tests, 63 percent of the experimental group improved their scores, compared to only 44 percent of the control group." This study led to a four hour BBC mini-series in 2010 called The Young Ones, where they repeated the same experiment with six older celebrities.

Unfortunately, it's hard to trust these results outside of what intuition confirms. The testing suite appears extensive, the sample size is terribly small, and the participants are certainly aware of the desired results (and appreciative of the researchers for their experience). Given the number of tests, the high variability of small samples, and the demand characteristics, it seems very likely at least a handful of results would come back 'significant' regardless of real underlying causality.

Despite that, the only quantitative difference– 63% vs 44% of participants improving in intelligence– merely sounds impressive. This is a manipulative framing of the underlying data; it's better described as 5 of 8 in one group vs. 4 of 9 in the other. (I assume; earlier Langer states 8 people are in the both groups, but 44% of 8 gives 3.5 people. One of the numbers must be incorrect– another red flag.) A difference in one person for groups of this size is not meaningful. Over the course of the (at least) twelve different tests administered, several will have this difference.

There are no peer reviewed publications associated with this study, either; for all the effort, it is described only briefly in her book, bracketed with very few interpretive statements. It seems like a cautious resignation to the minimal empirical value of the study's quantitative data. However, on page 167, she makes one overstatement:

The most dramatic example of language acting as placebo can be found in the counterclockwise study. The study used language to prime the participants, asking the elderly men at the retreat to speak about the past in the present tense. With language placing the experimental groups’ minds in a healthier place, their bodies followed suit.

To be clear, that isn't to say that a meaningful difference between groups doesn't exist, only that the statistics and anecdotes derived from such a severely underpowered study couldn't possibly have captured it.

The question remains– does acting like your younger self make a difference compared to reminiscing? Luckily, we can expect some rigor to end this story. Langer pre-registered a large scale replication to be carried out in 2020 in collaboration with an Italian team. Unfortunately, 'gathering a hundred 80 year olds in Italy' is the fastest way to get your research shut down during a pandemic– hopefully we'll see some results once things return to normal.

Perhaps the biggest questions are the ones left unasked. It seems quite obvious that bringing a group of dependent, isolated 70-80 year olds together on a week long nostalgia trip will revitalize them. Even with a small sample we can make that inference. The New York Times described the BBC participants as "apparently rejuvenated... [they] walked taller and indeed seemed to look younger. They had been pulled out of mothballs and made to feel important again."

How much of that is because they are have structured social time with peers? How much because they are breaking their routine and experiencing something novel together? How much because the environment and media are all visceral, nostalgic primes that reawaken latent, more vital and youthful self-concepts? How much because they are the subject of an important and meaningful scientific study? How much did it matter at all?

Aging is both a psychological and biological shift. It's certainly true that the psychological component is powerful, but it seems unlikely that the benefits here come from a re-conception of oneself as explicitly younger or a change of verbiage from past to present tense. In reality, most of the benefit will come from a renewed, socially-reinforced sense of feeling of purpose, agency, and novelty.

Pre-suasion

I've discussed some of Cialdini's misinterpretations before, and have a detailed description of some of the mistakes in this particular book here. Conceptually, Cialdini in this text broadly argues that very subtle (sometimes subconscious) interventions have a large effect on our decision making. This idea sits at the core of the replication crisis in social psychology. While there are some good insights in the book, they stand alongside several examples of weak empirical data.

Misbehaving

One of the most popular examples of the 'power of defaults' in behavioral economics comes from Johnson and Goldstein's 2003 Science paper 'Do Defaults Save Lives?', which is frequently used in a misleading way to suggest that for complex decisions, we get overwhelmed and will simply choose the default. This is a misinterpretation of the data– it turns out that many places have presumed consent, or don't take the consent seriously. Meta-analyses have not show the large effects typically attributed to the phenomena.

Thaler gives a full and reasonable treatment to the topic of opt-in and opt-out in his book, clearly aware of the problems I listed above. He settles on a policy suggestion of 'mandated choice' for organ donation, not at all relying on a 'default effect'.

Despite the careful thinking, he states that "[t]he findings of Johnson and Goldstein’s paper showed how powerful default options can be”, suggesting that these misleading results prove that the default choice effect is still powerful and important, just not binding for this specific case.

While this is a relatively minor problem, it's a misattribution worthy of note. It perpetuates a conception of human decision making that I believe is flawed– that very subtle changes drive important and complex decisions. The default effect may be real, but not when meaningful outcomes are on the line. This concept applies only when a decision is low-stakes and preferences are weak.

Generalizing the Evidence

These are some of the most famous examples of popular science books, written by very reputable and trustworthy academics (two of whom have Nobel Prizes and whom I admire greatly). In general, their mistakes are evidence of a lack of rigor in analyzing their sources. (To be fair, sources are hard to trust given the state of the replication crisis in the academic literature.) Mistakes range from relatively minor and sporadic (for Thaler) to quite major (entire books for Kahneman and Walker).

Unfortunately, any error– whether sporadic or frequent– undermines a text's trustworthiness. We're forced to fact check works of popular science instead of taking them at face value.

It seems the available evidence points heavily towards a presumption of guilt in popular science accounting, at least as far as social psychology is concerned. Until you've vetted an author, these books are better conceptualized as pointers to a collection of potentially interesting primary sources. Pulling signal from the noise requires statistical fluency, patience, and rigor.

Epistemology, Cognition, and the Future of Technology