# The Battle for Psychology

Epistemic status for the first part of this post:

Epistemic status for the second part:

## Part 1 – The opposite of power is bullshit

I recently met a young lady, and we were discussing statistical power. No, I don’t see why that’s weird. Anyway, I mentioned that an underpowered experiment is useless no matter if it gets a result or not. For example, the infamous power posing experiment had a power of merely 13%, so it can’t tell us whether power posing actually does something or not.

The young lady replied: For the power pose study, if the power was .13 (and beta was .87), then isn’t it impressive that they caught the effect despite the chance being only 13%?

This question reflects a very basic confusion about what statistical power is. I wish I could explain what it is with a link, but I had trouble finding a clear explanation of that anywhere online, including in my previous post. Wikipedia isn’t very helpful, and neither are the first 5 Google results. So, let me take a step back and try to explain as best I can what power is and what it does.

Statistical power is defined as the chance of correctly identifying a true positive result (if it exists) in an experiment. It is a function of the effect size, the measurement error, and the desired rate of false positives.

The effect size is what you’re actually after, but you can estimate it ahead of time and it has to be part of the hypothesis you’re trying to prove. You don’t care about power If your study is exploratory, but then you’re not setting to confirm any hypotheses either. The bigger the effect size, the higher the statistical power.

Measurement error is a function of your measurement accuracy (which you can’t really change) and the sample (which you can). The bigger the sample, the smaller the error, the higher the power.

The rate of false positives is how often you’ll think you have found an effect when none exists. This number is up to you, and it’s traded off against power: if you decrease the chance of confusing a false effect for a true one, you increase the chance of missing out on a true effect. For reasons of historical conspiracy, this is usually set at 5%.

The same historical conspiracy also decided that “identifying a true positive” means “rejecting the null hypothesis”. This is dumb (why do you care about some arbitrary null hypothesis and not the actual hypothesis you’re testing?) but we’ll leave it for now since (unfortunately) that’s how most research is done today.

Here’s how the actual power calculation is done in the power posing example. The hypothesized effect of power posing is a decrease in cortisol, and I found that this decrease can’t be much larger than 1 µg/dl. That’s the effect size. The measurement error of the change in cortisol (with a sample size of 20 people in each group) is 1.2 µg/dl.

To get a p-value of 0.05, the measured result needs to be at least 1.96 times the standard error. In our case the error is 1.2, so we need to measure an effect of at least 1.96 x 1.2 = 2.35 to reject the null. However, the real effect size is equal to about 1, and to get from 1 to 2.35 we need to also get really lucky. We need 1 µg/dl of real effect, and another 1.35 µg/dl of helpful noise.

This is pretty unlikely to happen, we can calculate the exact probability by just plugging the numbers into the normal distribution function in Excel:

1 – NORM.DIST(1.2*1.96,1,1.2,TRUE) = 13%.

When you run an experiment with 13% power, even if you found an effect that happens to exist you mostly measured noise. More importantly, getting a positive result in a low-powered experiment provides scant evidence that the effect is real. Let me explain why.

Let me explain why.

In an ideal world, setting a p-value threshold at .05 means that you’ll get a false positive (a positive result where no effect actually exists) 5% of the time. In the real world, that’s wildly optimistic.

First of all, even if you do the analysis flawlessly, the chance of a false positive given p<0.05 is around 30%, just due to the fact that true effects are rare. As we say in the Bayesian conspiracy: even if you’re not interested in base rates, base rates are interested in you.

But even without invoking base rates, analyses are never perfect. I demonstrated last time that if you measure enough variables, the false positive rate you can get is basically 100%. Obviously, either “weak men = socialism” or “strong men = socialism” is a false positive. I strongly suspect that both are.

Ok, let’s say that you guard against multiplicity and only test a single variable. What can go wrong? A thousand things can. Measurement mistakes, inference mistakes, coding mistakes, confounders, mismatches between the control and effect group, subjects dropping out, various choices about grouping the data, transforming the data, comparing the data, even cleaning the data.

Let’s take “cleaning the data” as the most innocuous sounding example. What cleaning the data usually means is throwing out data points that have some problem, like a subject that obviously misread the questions on a survey. Obviously. What’s the worst that can happen if you throw out a data point or two? I wrote a quick code simulation to find out.

I simulate two completely independent variables, so we should expect to get a p-value of 5% exactly 5% of the time. However, if I get to exclude just a single data point from a sample size of 30, the false positive rate goes to 14%. If I can throw out two “outliers”, it goes to 25%. You can use the code to play with different sample sizes, thresholds, and the number of outliers you’re allowed to throw out.

Bottom line: if you’re aiming for a 5% false positives, 30% is a very optimistic estimate of the actual false positive rate you’re going to get.

In the perfect world, your power is 80% and the false positive rate is 5%. This dream scenario looks like this:

The dark blue region are true positives (80% of the time), and the light blue sliver (5% of the remaining 20% are 1%) is what I call a “noise positive”: you got a positive result that matches a real effect by coincidence. Basically, your experiment missed the true effect but you p-hacked your way into it.

[Edit 6/25: “noise positive” is a bad name, I should have called it “hacked positive”]

Of course, you don’t know if the effect exists or not ahead of time, you have to infer it from the result of the experiment.

If all you know is that you got a result, you know you’re somewhere in the colored regions. If you combine them (the horizontal bar), you see that “getting a result” is mostly made up of dark blue, which means “true results for an effect that exists”. The ratio of blue to red in the combined bar is a visual representation of the Bayesian likelihood ratio, which is a measure of how much support the data gives to the hypothesis that the measured effect exists versus the hypothesis that it doesn’t.

So that was the dream world. In the power posing world, the power is 13% and the false positive rate is (optimistically) 30%. Here’s how that looks like:

In this world, you can’t infer that the effect exists from the data: the red area is almost as big as the blue one. All you know is that most of what you measured is measurement error, aka noise.

I think some confusion stems from how statistical power is taught. For a given experimental setup, there is a trade-off of false positives (also called type I error) versus false negatives (type II, and the thing that statistical power directly refers to). But having lower statistical power makes both things worse and higher power makes both things better.

You increase power by having a large sample size combined with repeated and precise measurements. You get low power by having a sample size of 20 college students and measuring a wildly fluctuating hormone from a single spit tube. With a power of 13%, both the false positive rate and the false negative rate are going to be terrible no matter how you trade them off against each other.

## Part 2 – Is psychology a science?

Part 1 was a careful explanation of a subject I know very well. Part 2 is a controversial rant about a subject I don’t. Take it for what it is, and please prove me wrong: I’ll be deeply grateful if you do.

What I didn’t mention about the confused young lady is that she’s a graduate student in psychology. From a top university. And she came to give a presentation about this excellent paper from 1990 about how everyone in psychology is doing bad statistics and ignoring statistical power.

You should read the Jacob Cohen paper, by the way. He’s everything I aspire to be when writing about research statistics: clear arguments combined with hilarious snark.

The atmosphere that characterizes statistics as applied in the social and biomedical sciences is that of a secular religion, apparently of Judeo—Christian derivation, as it employs as its most powerful icon a six-pointed cross [the asterisk * denotes p<0.05], often presented multiply for enhanced authority [** denotes p<0.01]. I confess that I am an agnostic.

I know that the psychology studies that show up on my Facebook feed or on Andrew Gelman’s blog are preselected for bad methodology. I thought that while bad statistics proliferated in psychology for decades (despite warnings by visionaries like Cohen), this ended in 2015 when the replicability crisis showed that barely 36% of finding in top psych journals replicate. Some entrenched psychologists who built their careers on p-hacked false findings may gripe about “methodological terrorists“, but surely the young generation will wise up to the need for proper methodology, the least of which is doing a power calculation.

And yet, here I was talking to a psychology student who came to give a presentation about statistical power and had the following, utterly stupefying, exchange:

Me: So, how did you calculate statistical power in your own research?

Her: What do you mean ‘how we calculated it’?

Me: The statistical power of your experiment, how did you arrive at it?

Her: Oh, our research was inspired by a previous paper. They said their power was 80%, so we said the same.

This person came to present a paper detailing the ways in which psychologists fuck up their methodology, and yet was utterly uninterested in actually doing the methodology. When I gave her the quick explanation of how power works I wrote down in part 1 of this post, she seemed indifferent, as if all this talk of statistics and inference was completely irrelevant to her work as a psychology researcher.

This is mind boggling to me. I don’t have any good explanation for what actually went on in her head. But I have an explanation, and if it’s true it’s terrifying.

As centuries of alchemy, demonology and luminiferious aetherology have shown, the fact that smart people study a subject does not imply that the subject actually exists. I’m beginning to suspect that a whole lot of psychologist are studying effects that are purely imaginary.

Embodied cognition. Ego-depletion. Stereotype threat. Priming. Grit and growth mindset as malleable traits. Correlating political stances with 3 stages of ovulation, 5 personality traits, 6 body types, and 34 flavors of ice cream.

And we didn’t even mention fucking parapsychology.

By the way, when was the last time you saw a psychology article publicized that wasn’t in those areas? Yes, of course this young lady was pursuing her research in one them.

Perhaps these fields aren’t totally null, but their effects are too small or unstable to be studied given current constraints (e.g. sample sizes less than 10,000). Even if half those fields contain interesting and real phenomena, that leaves a lot of fields that don’t.

If you’re studying a nonexistent effect, there’s no point whatsoever in improving your methodology. It’s only going to hurt you! If you’re studying a nonexistent field, you write “our power is 80%” on the grant application because that’s how you get funded, not because you know what it means. Your brain will protect itself from actually finding out what a power calculation actually entails because it senses bad things lying in wait if you go there.

There are ways to study subjects with no physical reality, but the scientific method isn’t one of those ways. Example: you have the science of glaciology which involves drilling into glaciers, and you have the critical theory of feminist glaciology which involves doing something completely different:

Structures of power and domination also stimulated the first large-scale ice core drilling projects – these archetypal masculinist projects to literally penetrate glaciers and extract for measurement and exploitation the ice in Greenland and Antarctica.

Glaciologists and feminist glaciologists are not colleagues. They don’t sit in the same university departments, speak at the same conferences, or publish in the same journals. Nobody thinks that they’re doing the same job.

But people think that bad psychologists and good psychologists are doing the same job. And this drives the good ones away.

Imagine a psychology grad student with good intentions and a rudimentary understanding of statistics wandering into one of the above-mentioned research areas. She will quickly grow discouraged by getting null results and disillusioned by her unscrupulous colleagues churning out bullshit papers like “Why petting a donkey makes you vote Democrat, but only if your phone number ends in 7, p=0.049”.

This student isn’t going to stick around in academia very long. She’s going to take a quick data science boot camp and go work for Facebook or Amazon. There, she will finally get to work with smart colleagues doing rigorous psychological research using sophisticated statistical methodology.

But instead of probing the secrets of the mind for the benefit of humanity, she’ll be researching which font size makes people 0.2% more likely to retweet a meme.

And the academic discipline of psychology will drift one step further away from science and one step closer to the ugly void of postmodernism.

I’m not writing this out of disdain for psychology. I’m writing this out of love, and fear. I have been reading books on psychology and volunteering as a subject for experiments since I was a teenager. I considered majoring in psychology in undergrad, until the army told me I’m going to study physics instead. I occasionally drink with psychologists and tell them about p-hacking.

I’m scared that there’s a battle going on for the heart of psychology right now, between those who want psychology to be a science and those for whom that would be inconvenient.

And I’m scared that the good guys aren’t winning.

## 8 thoughts on “The Battle for Psychology”

1. The more I read here the happier I am that I don’t expect to be needing complicated statistics in my work anytime soon 🙂

This problem you described (I wanted to write this on the socialist strongmen post) is usually overcome in other fields by competitive peer-review – competitors who have enough incentive to go deep enough into your data, and explain that your data doesn’t say what you say it says. This doesn’t happen in psychology? Maybe journal editors who don’t consider such a smackdown as a reason to turn down a paper? Maybe out of fear of missing out on a sensationlist story and losing readership.

Also, out of a feeling of historic collegiality, I’d like to dispute your addition of alchemists to that list, having learned a lot about them in preparing my lecture for last Olamot. They were a pretty awesome bunch.

Like

1. Does statistical power still seem so complicated? It means I still didn’t do a good enough job of explaining it.

Here’s the problem: you can arrange sciences according to the level on which they model reality. This goes something like physics -> chemistry -> biology -> psychology -> sociology. And the better someone is at math, the more likely they are to self-select into the more “fundamental” sciences.

But in some sense, it should really be the opposite: modeling reality on the level of minds or societies is so complex that you need to be very good at math to make progress without fooling yourself. And yet it seems like people go into psychology because they don’t want to do math, not in spite of it.

Like

2. Nathan Nguyen says:

Out of curiosity, what are the main problems with the five factors of personality? I’m unfamiliar with the area. Can you point to what you think is the best criticism of the model?

Like

1. Wikipedia covers the main critiques rather well, but I don’t actually have a problem with it per se. It is what it is: personality is made up of countless components, so you do a principal component analysis on those and then pick a number of factors to use. Briggs-Myers picked 4, OCEAN picked 5, you could use 30 if you want. I guess 5 is as good as any other number.

The source of bullshit is the recent trend of regressing political affiliation on everything under the sun, from ovulation to fat arms. You can get any p-value you want if you try enough variables. Whatever you find will imply that some political group deserves higher status than another, which is a cheap way to get your article publicized and hyped in newspapers and also a great way to destroy the bonds that hold civilization together.

Like

3. I didn’t understand the statistical power section until I clicked the 1.96 link. Perhaps add the sentence that it’s the 97.5% point on a normal distribution?

Just to clarify, in order to get a statistical power, you need to estimate the effect size beforehand, right?

Re: part 2, let alone psychology, my freshman genetics class last year had an insanity-inducing lecture on stats which involved using a chart to look up some alien kind of p-value, which denoted accepting the trial (not null) hypothesis when it was between 0.05 and 0.95
Plus some chi-square tests, where we pulled degrees of freedom out of nowhere. It wasn’t on the exam, so I decided not to pollute my brain.

But as you say in the comment, this stuff is pretty hard!
I’m reading MacKay’s Information Theory, Inference, and Learning Algorithms (which is wonderful, though I’m sure you’ve come across it), and doing hypothesis testing properly IMO requires a significant amount of mathematical maturity.
I don’t know how we get around that.

Like

4. Stefano says:

Is it correct to calculate the standard deviation error in the cortisol measurement including the first term? In the original paper (CC&Y) it’s not clear how they analyzed the cortisol results: did they averaged the cortisol level at time 1 and the level at time 2? Or did they just averaged the difference of the samples?
Because in that second case, you wouldn’t need to account for the variation per person, you just need the average variation (your 2.5) and the inter-assay CV (which you put around 2.8, definitely not something you can left unspecified).

Like