Multiplicitous

Protect yourself from p-hacking with precision, whether you’re doing drugs or gambling.

P-Vices

Of all the terrible vices known to man, I guess NFL gambling and alternative medicine aren’t very terrible. Making medicine that doesn’t work (if it worked, it wouldn’t be alternative) is also a tough way to make money. But if you’re able to squeeze a p-value of 0.05 for acupuncture out of a trial that clearly shows that acupuncture has zero effect, you can make money and get a PhD in medicine!

It’s also hard to make money off of gambling on the NFL. However, you can make money by selling NFL gambling advice. For example, before the Eagles played as 6 point underdogs on the turf in Seattle after a 208 yard rushing game, gambling guru Vince Akins declared:

The Eagles are 10-0 against the spread since Dec 18, 2005 as an underdog and on turf after they’ve had more than 150 yards rushing last game.

10-0! Betting against the spread is a 50-50 proposition, so 10-0 has a p-value of 1/(2^10) =  1/1024 = 0.0009. That’s enough statistical significance not just to bet your house on the Eagles, but also to get a PhD in social psychology.

The easiest way to generate the p-value of your heart’s desire it to test multiple hypotheses, and only report the one with the best p-value. This is a serious enough problem when it happens accidentally to honest and well-meaning researchers to invalidate whole fields of research. But unscrupulous swindlers do it on purpose and get away it, because their audience suffers from two cognitive biases:

  1. Conjunction fallacy.
  2. Sucking at statistics.

No more! In this new installment of defense against the dark arts we will learn to quickly analyze multiplicity, notice conjunctions, and bet against the Eagles.


Hacking in Depth

[This part gets stats-heavy enough to earn this post the Math Class tag. If you want to skip the textbooky bits and get down to gambling tips, scroll down to “Reading the Fish”]

The easiest way to generate multiple hypotheses out of a single data set (that didn’t show what you wanted it to show) is to break the data into subgroups. You can break the population into many groups at once (Green Jelly Bean method), or in consecutive stages (Elderly Hispanic Woman method).

The latter method works like this: let’s say that you have a group of people (for example, Tajiks) who suffer from a medical condition (for example, descolada). Normally, exactly one half of sick people recover. You invented a miracle drug that takes that number all the way up to… 50%. That’s not good enough even for the British Journal of General Practice.

But then you notice that of the men who took the drug 49% recovered, and of the women, 51% did. And if you only look at women above age 60, by pure chance, that number is 55%. And maybe 13 of these older Tajik women, because Tajikistan didn’t build a wall, happened to be Hispanic. And of those, by accident of random distribution, 10 happened to recover. Hey, 10 out of 13 is a 77% success rate, and more importantly it gives a p-value of… 0.046! Eureka, your medical career is saved with the publication of “Miracle Drug Cures Descolada in Elderly Hispanic Women” and you get a book deal.

Hopefully my readers’ nose is sharp enough not to fall for 10/13 elderly Hispanic women. Your first guide should be the following simple rule:

Rule of small sample sizes – If the sample size is too small, any result is almost certainly just mining noise.

Corollary – If the sample size of a study is outnumbered 100:1 by the number of people who died because of that study,  it’s probably not a great study.

But what if the drug did nothing detectable for most of the population, but cured 61 of 90 Hispanic women of all ages. That’s more than two thirds, the sample size of 90 isn’t tiny, and it comes out to a p-value of 0.0005, is that good enough?

Let’s do the math.

First, some theory. P-values are generally a terrible tool. Testing with a p-value threshold of 0.05 should mean that you accept a false result by accident only 5% of the time, yet even in theory using a 5% p-value makes a fool of you over 30% of the time.  P-values do one cool thing, however: they transform any distribution into a uniform distribution. For example, most samples from a normal distribution will lie close to the mean, but their p-values will be spread evenly (uniformly) across the range between 0 and 1.

pvaluaton.png

Uniform distributions are easy to deal with. For example, if we take N samples from a uniform distribution and arrange them by order, they will fall on average on 1/N+1, 2/N+1, .. N/N+1. If you test four hypotheses (e.g. that four kinds of jelly beans cause acne) and their p-values fall roughly on 1/(4+1) = 0.2, 0.4, 0.6, 0.8 you know that they’re all indistinguishable from the null hypothesis as a group.

Usually you would only see the p-value of the best hypothesis reported, i.e. “Wonder drug cures descolada in elderly Hispanic women with p=0.0005”. The first step towards sanity is to apply the Bonferroni rule:

Bonferroni Rule – A p-value of α for a single hypothesis is worth about as much as a p-value of α/N for the best of N hypotheses.

The Bonferroni correction is usually given as an upper bound, namely that if you use an α/N p-value threshold for N hypotheses you will accept a null hypothesis as true no more often than if you use an α threshold for a single hypothesis. It actually works well as an approximation too, allowing us to replace no more often with about as often. I haven’t seen this math spelled out in the first 5 google hits, so I’ll have to do it myself.

h1,…,hN are N p-values for N independent tests of null hypotheses. h1,…,hN are all uniformly distributed between 0 and 1.

The chance that one of the N p-values falls below α/N = P(min(h1,…,hN) < α/N) = 1 – (1 – α/N)N ≈ 1 – e ≈ 1 – (1-α) = α = the chance that single p-value falls below α. The last bit of math there depends on a linear approximation of ex =1+x when x is close to 0.

The Bonferroni Rule applies directly when the tests are independent, but that is not the case with the Elderly Hispanic Woman method. The “cure” rate of white men is correlated positively with the rate for all white people (a broader category) and with young white men (a narrower subgroup). Is the rule still good for EHW hacking? I programmed my own simulated unscrupulous researcher to find out, here’s the code on GitHub.

My simulation included Tajiks of three age groups (young, adult, old), two genders (the Tajiks are lagging behind on respecting genderqueers) and four races (but they’re great with racial diversity). Each of the 2*3*4=24 subgroups has 500 Tajiks in it, for a total population of 12,000. Descolada has a 50% mortality rate, so 6,000 / 12,000 cured is the average “null” result we would expect if the drug didn’t work at all. For the entire population, we would get p=0.05 if only 90 extra people were cured (6,090/12,000) for a success rate of 50.75%. A meager success rate of 51.5% takes the p-value all the way down to 0.0005.

Rule of large sample sizes – With a large sample size, statistical significance doesn’t equal actual significance. With a large enough sample you can get tiny p-values with miniscule effect sizes.

Corollary – If p-values are useless for small sample sizes, and they’re useless for large sample sizes, maybe WE SHOULD STOP USING FUCKING P-VALUES FOR HYPOTHESIS TESTING. Just compare the measured effect size to the predicted effect size, and use Bayes’ rule to update the likelihood of your prediction being correct.

P-values aren’t very useful in getting close to the truth, but they’re everywhere, they’re easy to work with and they’re moderately useful for getting away from bullshit. Since the latter is our goal in this essay we’ll stick with looking at p-values for now.

Back to Tajikistan. I simulated the entire population 1,000 times for each of three drugs: a useless one with a 50% success rate (null drug), a statistically significant one with 50.75% (good drug) and a doubly significant drug with 51.5% (awesome drug). Yes, our standards for awesomeness in medicine aren’t very high. I looked at the p-value of each possible sub-group of the population to pick the best one, that’s the p-hacking part.

Below is a sample of the output:

13 hispanic 1    0.122530416511473
14 female hispanic 2    0.180797304026783
15 young hispanic 2    0.25172233581543
16 young female hispanic 3    0.171875
17 white 1    0.0462304905364621
18 female white 2    0.572232224047184
19 young white 2    0.25172233581543
20 young female white 3   0.9453125
21 adult 1    0.368777154492162
22 female adult 2    0.785204746078306
23 asian 1    0.953769509463538
24 female asian 2    0.819202695973217

The second integer is the number of categories applied to the sub-group (so “asian” = 1, “asian adult female” = 3). It’s the “depth” of the hacking. In our case there are 60 groups to choose from: 1 with depth 0 (the entire population), 9 with depth 1, 26 with depth 2, 24 with depth 3. Since we’re testing the 60 groups as 60 separate hypotheses, by the Bonferroni Rule the 0.05 p-value should be replaced with 0.05 / 60 categories = 0.00083.

In each of the 1,000 simulations, I picked the group with the smallest p-value and plotted it along with the “hacking depth” that achieved it. The vertical lines are at p=0.05 p=0.00083. The horizontal axis the hacked p-value on a log scale and the vertical how many of the 1,000 simulations landed below it:

3 Drug Phack labeled.png

For the null drug, p-hacking achieves a “publishable” p-value of

If your goal is to do actual science (as opposed to getting published in Science), you want to be comparing the evidence for competing hypotheses, not just looking if the null hypothesis is rejected. The null hypothesis is that we have a 50% null drug, and the competing hypothesis are the good and awesome drugs at 50.75% and 51.5% success rates, respectively.

Without p-hacking, the null drug will hit below p=0.05 5% of the time (duh), the good drug will get there 50% of the time, and the awesome drug 95% of the time. To a Bayesian, getting a p-value below 0.05 is a very strong signal that we have a useful drug on our hands: 50%:5% = 10:1 likelihood ratio that it’s the good drug and 95%:5% = 19:1 that it’s the awesome drug. If ahead of time we thought that each of the cases is equally likely (1:1:1) ratio, our ratios would now be 1:10:19, this means that the probability that the drug is the null one went from 1/3 to 1/(1+10+19) = 1/30. The null drug is 10 times less likely.

If you’re utterly confused by the preceding paragraph, you can read up on Bayes’ rule and likelihood ratio on Arbital, or just trust me that without p-hacking, getting a p-value below 0.05 is a strong signal that the drug is useful.

With p-hacking, however, the good and the awesome drugs don’t do so well. We’re now looking how often each drug falls below the Bonferroni Rule line of p=0.00083. Instead of 50% and 95% of the time, the good and awesome drugs get there 23% and 72% of the time. If we started from 1:1:1 odds, the new odds are roughly 1:5:15, and the probability that the drug is the null one is 1/21 instead of 1/30. The null drug is only 7 times less likely.

Rule of hacking skepticism – Even though a frequentist is happy with a corrected p-value, a Bayesian knows better. P-hacking helps a bad (null) drug more than it does a good one (which is significant without hacking). Thus, hitting even a corrected p-value threshold is weaker evidence against the null hypothesis.

You can see it in the chart above: for every given p-value (vertical line) the better drugs have more green points in its vicinity (indicating less depth of hacking) and the bad drug has more red because it has to dig down to a narrow subgroup to luck into significance.

Just for fun, I ran another simulation in which instead of holding the success probability for each patient constant, I fixed the total proportion of cures for each drug. So in the rightmost line (null drug) exactly 6,000 of the 12,000 were cured, for the good drug exactly 6,090 and for the awesome drug 6,180.

good-drug-phack-labeled

We can see more separation in this case – since the awesome drug is at p=0.0005 for the entire group, hacking can not make it any worse (that’s where the tall green line is). Because for each drug the total cures are fixed, if one subgroup has more successes the other one by necessity have less. This mitigates the effects of p-hacking somewhat, but the null drug still gets to very low p-values some of the time.


Reading the Fish

So what does this all mean? Let’s use the rules we came up with to create a quick manual for interpreting fishy statistics.

  1. Check the power – If the result is based has a tiny sample size (especially with a noisy measure) disregard it and send an angry email to the author.
  2. Count the categories – If the result presented is for a subgroup of the total population tested (i.e. only green beans, only señoritas) you should count N – the total number of subgroups that could have been reported. Jelly beans come in 50 flavors, gender/age/race combine in 60 subgroups, etc.
  3. Apply the correction – divide the original threshold p-value by the N you calculate above. If the result is in that range, it’s statistically significant.
  4. Stay skeptical – Remember that a p-hacked result isn’t as good of a signal even with correction, and that statistical significance doesn’t imply actual significance. Even an infinitesimal p-value doesn’t imply with certainty that the result is meaningful, per the Rule of Psi.

Rule of Psi – A study of parapsychological ability to predict the future produced a p-value of 0.00000000012. That number is only meaningful if you have absolute confidence that the study was perfect, otherwise you need to consider your confidence outside the result itself. If you think that for example there’s an ε chance that the result is completely fake, that ε is roughly the floor on p-values you should consider.

For example, if I think that at least 1 in 1,000 psychology studies have a fatal experimental flaw or are just completely fabricated, I would give any p-value below 1/1,000 about as much weight as 1/1,000. So there’s a 1.2*10^-10 chance that the parapsychology meta-analysis mentioned above was perfect and got a false positive result by chance, but at least 1 in 1,000 chance that one of the studies in it was bogus enough to make the result invalid.

Let’s apply our manual to the Eagles game:

The Eagles are 10-0 against the spread since Dec 18, 2005 as an underdog and on turf after they’ve had more than 150 yards rushing last game.

First of all, if someone tells you about a 10-0 streak you can assume that the actual streak is 10-1. If the Eagles had won the 11th game going back, the author would have surely said that the streak was 11-0!

The sample size of 11 is really small, but on the other hand in this specific case the error of the measure is 0 – we know perfectly well if the Eagles won or lost against the spread. This doesn’t happen in real life research, but when the error is 0 the experimental power  is perfect and a small sample size doesn’t bother us.

What bother us is the insane number of possible variables the author could have mentioned. Instead of [Eagles / underdog / turf / after 150 yards rushing], the game could be described as [Seattle / at home / after road win / against team below average in passing] or [NFC East team / traveling west / against a team with a winning record] or [Team coming off home win / afternoon time slot / clear weather / opponent ranked top-5 in defense]. It’s hard to even count the possibilities, we can try putting them in separate bins and multiplying:

  1. Descriptors of home team – division, geography, record, result of previous game, passing/rushing/turnovers in previous game, any of the many season stats and rankings – at least 20
  2. Descriptors of road team – at least 20
  3. Game circumstances – weather, time slot, week of season, field condition, spread, travel, matchup history etc. – at least 10.

Even if you pick just 1 descriptor in each category, this allows you to “test” more than 20*20*10 = 4,000 hypotheses. What does it mean for the Eagles? A 10-1 streak has a p-value of 0.0107, about 1/93. But we had 4,000 potential hypotheses! 1/93 is terrible compared to the p=1/4,000 we would have expected to see by chance alone.

Of course, this means that the gambling guru didn’t even bother testing all the options, he did just enough fishing to get a publishable number and published it. But with so much potential hacking, 10-1 is as much evidence against the Eagles as it is in their favor. The Eagles, a 6 point underdog, got their asses kicked by 11 points in Seattle.

trinity-ad
Advertisement poster for Trinity College

You can apply the category-counting method whenever the data you’re seeing seems a bit too selected. The ad above tells you that Trinity is highly ranked in “faculties combining research and instruction”. This narrow phrasing should immediately make you think of the dozens of other specialized categories in which Trinity College isn’t ranked anywhere near the top. A college ranked #1 overall  is a great college. A college ranked #1 in an overly specific category within an overly specific region is great at fishing.


To No And

It’s bad enough when people don’t notice that they’re being bamboozled by hacking, but digging deep to dredge up a narrow (and meaningless) result can sound more persuasive than a general result. Here’s an absolutely delightful example, from Bill Simmons’ NFL gambling podcast:

Joe House: I’m taking Denver. There’s one angle that I like. I like an angle.

Bill: I’m waiting.

Joe House: Defending Super Bowl champions, like the Denver Broncos, 24-2 against the spread since 1981 if they are on the road after a loss and matched up against a team that the previous week won both straight up and against the spread and the Super Bowl champion is not getting 7 or more points. This all applies to the Denver Broncos here, a wonderful nugget from my friend Big Al McMordie.

Bill: *sigh* Oh my God.

I’ve lost count of how many categories it took House to get to a 24-2 (clearly 24-3 in reality) statistic. What’s impressive is how House sounds more excited with each “and” he adds to the description of the matchup. To me, each new category decreases the likelihood that the result is meaningful by multiplying the number of prior possibilities. To House, it seems like Denver fitting in such an overly specific description is a coincidence that further reinforces the result.

This is called the conjunction fallacy. I’ll let Eliezer explain:

The conjunction fallacy is when humans rate the probability P(A&B) higher than the probability P(B), even though it is a theorem that P(A&B) <= P(B).  For example, in one experiment in 1981, 68% of the subjects ranked it more likely that "Reagan will provide federal support for unwed mothers and cut federal support to local governments" than that "Reagan will provide federal support for unwed mothers." […]

Which is to say:  Adding detail can make a scenario SOUND MORE PLAUSIBLE, even though the event necessarily BECOMES LESS PROBABLE. […]

In the 1982 experiment where professional forecasters assigned systematically higher probabilities to “Russia invades Poland, followed by suspension of diplomatic relations between USA and USSR” versus “Suspension of diplomatic relations between USA and USSR”, each experimental group was only presented with one proposition. […]

What could the forecasters have done to avoid the conjunction fallacy, without seeing the direct comparison, or even knowing that anyone was going to test them on the conjunction fallacy?  It seems to me, that they would need to notice the word “and”.  They would need to be wary of it – not just wary, but leap back from it.  Even without knowing that researchers were afterward going to test them on the conjunction fallacy particularly.  They would need to notice the conjunction of two entire details, and be shocked by the audacity of anyone asking them to endorse such an insanely complicated prediction.

Is someone selling you a drug that works only when the patient is old and a woman and Hispanic? A football team that is an underdog and on turf and good at rushing? One “and” is a warning sign, two “ands” is a billboards spelling BULLSHIT in flashing red lights. How about 11 “ands”?

11 “ands” is a level of bullshit that can only be found in one stinky stall of the washrooms of science, the gift that keeps on giving, the old faithful: power posing. After power posing decisively failed to replicate in an experiment with 5 times the original sample size, the authors of the original study listed 11 ways in which the experimental setup of the replication differed from the original (Table 2 here). These differences include: time in poses (6 minutes in replication vs. 2 minutes in the original), country (Switzerland vs. US), filler task (word vs. faces) and 8 more. The authors claim that any one of the differences could account for the failure of replication.

What’s wrong with this argument? Let’s consider what it would mean for the argument to be true. If it’s true that any of the 11 changes to the original setup could destroy the power posing effect, it means that the power posing effect only exists in that very specific setup. I.e. power posing only works when the pose is held for 2 minutes and only for Americans and only after a verbal task and 8 more ands. If power posing requires so many conjunctions, it was less probable to start with than the chance of Amy Cuddy admitting that power posing isn’t real.

The first rule of improv comedy is “Yes, and…” The first rule of statistical skepticism is “And…,no.”

Rule of And…, no – When someone says “and”, you say “no”.

Year 1 Redux – Poseur

Amy Cuddy won’t let power posing go, so neither can I.

I try to maintain equanimity regarding most bitter conflicts raging in the world, but I do get quite worked up regarding proper statistical methodology in psychology research. Hey, we all need our hobbies. When I wrote a post about it, I tried to focus on constructive advice on how to do science better (calculation of experimental power), but I couldn’t resist taking some shots at scientists who neglected to do that.

In particular, I criticized Dana Carney, Amy Cuddy and Andy Yap for publishing the infamous power pose paper, a useless experiment that had 13% statistical power.That is, the experiment had a 13% chance to detect the effect had one existed. If it turns out that the effect doesn’t exist, the experiment was 100% worthless.

The paper is called “Power posing: brief nonverbal displays affect neuroendocrine levels and risk tolerance” so it actually looked at three effects: two neuroendocrinal (cortisol and testosterone) and a behavioral risk tolerance effect. Even a blind person may hit an occasional bird when shooting three arrows, but CC&Y were not in luck: none of three effects turned out to exist. That wasn’t unexpected: holding a strange pose for a minute will not affect most things in your life.

Last month, it seemed that this silly controversy has been decisively resolved in favor of truth and reason when lead author Dana Carney posted this on her academic website: (emphasis in original)

Since early 2015 the evidence has been mounting suggesting there is unlikely any embodied effect of nonverbal expansiveness (vs. contractiveness)—i.e.., “power poses” – – on internal or psychological outcomes.

As evidence has come in over these past 2+ years, my views have updated to reflect the evidence. As such, I do not believe that “power pose” effects are real.

Any work done in my lab on the embodied effects of power poses was conducted long ago (while still at Columbia University from 2008-2011) – well before my views updated. And so while it may seem I continue to study the phenomenon, those papers (emerging in 2014 and 2015) were already published or were on the cusp of publication as the evidence against power poses began to convince me that power poses weren’t real. My lab is conducting no research on the embodied effects of power poses.

To drive the point home, Carney lists 10 methodological errors and 3 confounders of the original study, declares that power study is a dead-end for research, and moves on.

The third author, Andy Yap, switched academic fields and continents to study organizational behavior at an elite French business school.  He probably tells people that the power posing thing was this other Andy Yap in psychology. Or the other other Andy Yap, the one showing off his killer pecs on Instagram.

Science advances one funeral at a time. – Max Planck

more funerals => science advances => better weapons => more funerals – Steven Kaas

Popular science advances one mass funeral at a time. – Me

In between funerals, science advances when scientists say “oops” and stalls when they don’t.


Arguing in favor of Cuddy saying “oops” are: the invalid design of the original experiment, 6 years of contradicting data, the acknowledgment of both by the study’s lead author.

Arguing against Cuddy saying “oops”: her book ($28 on Amazon), and her speaking fees (a lot more than $28).

Which side are you betting on?

I was surprised by a recent statement that the power pose effect is “not real” and I want to set the record straight on where the science stands.

That’s exactly where the science stands: the power pose effect is not real. Not “not real”, just not real.

There are scores of studies examining feedback effects of adopting expansive posture (colloquially known as “power posing”) on various outcomes.

“Various outcomes” sounds like a lot of other people are shooting arrows in the air. Who knows, maybe some will hit. If I was a researcher, the first thing I would test is the effect of prolonged power posing on back pain.

The key finding, the one that I would call “the power posing effect,” is simple: adopting expansive postures causes people to feel more powerful. The other outcomes (behavior, physiology, etc.) are secondary to the key effect.

Bullshit. That’s a lie of Trump-level brazenness, it’s contradicted by the very title of the paper Cuddy talks about, the one with the “neuroendocrinal” and the “risk taking”. According to Carney, the primary variable of interest was risk-taking behavior, followed by testosterone and cortisol. “Feeling powerful” is a side effect, listed at the end, with no accompanying chart, almost as an afterthought.

There’s a reason “feeling powerful” was an afterthought: even if it exists, it has little scientific value. It’s a self-reported measure that doesn’t necessarily manifest itself in any behavioral changes, such as actually being more powerful. If it did, we would study those behaviors directly. Being entirely subjective, “feeling powerful” is highly prone to experimenter bias. Experimenter bias is the wonderful effect that lets a scientist detect supernatural ESP, if and only if the scientist himself believed in ESP to start with. Experimenter bias is something that Carney and Cuddy themselves admitted was an issue in the original design.

However, in this case the self-reported feeling wasn’t actually the result of accidental experimenter bias. It was the result of purposeful experimenter manipulation:

The self-report DV was p-hacked in that many different power questions were asked and those chosen were the ones that “worked.”

Carney should be commended for being so forthright, it takes courage to admit such a thing with no sugar-coating about your own work. She must have stood in one hell of a power pose before writing this.

Back to Cuddy:

I also cannot contest the first author’s recollections of how the data were collected and analyzed, as she led both.

Sniping at Carney doesn’t make Cuddy right, it just shows the contrast between a scientist and a charlatan.

By today’s improved methodological standards, the studies in that paper — which was peer-reviewed — were “underpowered,” meaning that they should have included more participants.

That’s not what “underpowered” means. In the power posing case, it means “useless”. It’s not that the original experiment discovered some truth  and better methodology would discover more. The original experiment abused pure random noise until it got a (miscalculated) p-value of 5%, and that was enough for the “peer reviewers”. Some of these same peer reviewers and psychology journal editors are calling people who insist on using correct statistical analysis “methodological terrorists”  and sabotaging their academic careers.

Open science must be inclusive.

In one word: no. In two: fuck no. Science doesn’t need to be inclusive, or egalitarian, or warm and fuzzy. It needs to be correct. And in order to be correct, science must reject theories that have proven to be bullshit, like phlogiston and elan vital and power posing.

Finally, I am concerned that the tenor of discussions like the one that has been unfolding on power posing, and the tendency to discount an entire area of research on the basis of necessary corrections or differences between scientists’ assessments, may have a chilling effect on science.

The reason people discount power posing research is that the first 50 Google hits on “power posing” are about Amy Cuddy’s article. If she shut up about that one pathetic experiment and let the “scores of scientists” she mentions do their work, we might actually discover some truth about embodied cognition. Perhaps all the research in this area is underpowered given how much noise there is and how weak the effects are. Perhaps the only way to study the field may be to run experiments with 40 coauthors and 4,000 subjects, like we do in medicine. Before we learn anything new about the field, we should discard the things we know are wrong.

“Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance” is the worst thing that could have happened to embodied cognition research. It contributed negative knowledge to the field. Cuddy’s refusal to let go of it for selfish reasons makes life so much harder for the psychologists trying to move the science forward.

billy-madison-principal

Quality of Inequality

Measuring economic inequality and its consequences is hard, so why bother when you can just pick numbers that fit your preferred narrative?

Income inequality is exploding in the US. Or at least, talk of income inequality is. Newspapers are full of it. Presidential candidates build their platforms around it. A 700-page book about economic inequality sells at quantities usually reserved for vampire romance. The Pope tweets about it (about inequality that is, not about vampire romance, although he cares about that as well).

Has inequality really gotten so… unequal? If it has, what are the actual negative consequences? If the consequences are terrible, what can we do about them? If we can’t do anything, should we find someone to blame and yell at them? In the next two posts I’ll get these issues of inequality, but first we need to address the issue of quality: 99% of what you read about inequality is crap.

As discussions of economic inequality became mainstream, economists have warned that inequality is incredibly hard to measure, let alone predict. In response, everyone agreed to tone their rhetoric down until we can all agree on a reasonable approach.

Just kidding! Everyone kept shouting whichever version supports their favorite political position. Both sides of the political aisle are happy to present terrible numbers and ignore any and all inconvenient facts, but the higher volume (in bulk and in decibels) of inequality discussion is coming from the economic left. If you want to get clicks, a good bet is to write how inequality is super-terrible-worse-than-sparkling-vampires think piece with a chart or two sprinkled in. I’m going to mainly list the flaws in these types of articles, since I want to focus on inequality itself for now and not on the general left-right economic debate. Many of the “rising tide lifts all boats” arguments coming from the right are equally vacuous, but won’t be addressed here.

Here are several ways in which a writer can make inequality appear as horrifying (or benign) as they think their readers will like to hear:

compare wealth instead of income

What does it mean to be poor? Does it mean low wages, low income after taxes and transfers are accounted for, or simply low expenditures?. You may intuitively think that being poor means having low wealth, but that doesn’t work if you measure wealth in a straightforward manner. Unlike income, many people’s wealth is negative even if their lives are affluent and secure. Think of a young lawyer at a prestigious firm with student loans from a top law school. Her wealth (loans included) could be -$150,000 which puts her well in the bottom percentile by household wealth. However, the young lawyer making six figures isn’t really suffering from economic hardship.

Adding in negative wealth numbers allows you to come up with arbitrarily stupid statistics like that the top 20% own 511 times as much wealth as the bottom 40%. At around the 37th percentile,  the cumulative sum of wealth is a tiny number close to zero. My own humble personal wealth is a trillion times higher than the bottom 37% of US households, and a negative trillion times higher than the bottom 36.99%.

I’m not a violent person and don’t have many unshakeable principles, but I will punch anyone quoting statistics that compare aggregates of positive and negative numbers.

While aggregating wealth is stupid, it seems to weird not to count it when comparing the rich and the poor. I think there’s a very intuitive way to combine wealth, income and economic power into a single useful measure. It’s what the finance world uses to value everything from loans to companies: the net present value of future earnings.

If the bottom 40% really owned nothing, the 1% could, for example, buy all the houses in the US and the 40% would end up homeless. The reason people are willing to sell a house to somebody who has no wealth is that they can count on the homeowner’s income paying for the mortgage. Every American citizen has inherent wealth in their potential to earn income, or at the very least in their potential to collect a welfare check. Of course, earning potential isn’t distributed equally and depends on things like education, social class, and actual wealth already owned. That potential wealth can be calculated, discounted at a proper rate and added to current wealth to paint a much more sensible picture of inequality. I’ve never seen anyone do this.

Either I’m a creative genius (quite possible) or economists have shied away from assessing the NPV of people because it’s impossible to do with any accuracy. Of course, this isn’t something that intimidates this specific blogger. We’ll put a number on Americans’ NPV in the next post of this series, but first there is more bullshit to guard against.

pick convenient end points

If you measure income growth from a peak year that came just before a recession (like 1979 or 2007) to a year just after the a recession ends, the numbers look terrible for workers because the (forward looking) stock market rebounds much faster than wages. On the other hand, if you measure from 1982 to 2007 the growth looks amazing. The business cycles in the last few decades have been so volatile that you can see any trend you want in the data with a judicious choice of end points.

“forget” to adjust for household size

Almost all inequality statistics use a household as the basic measuring unit, but conveniently ignore the fact that households contain varying numbers of people.

Let’s imagine a completely made up example of a guy living for a year with his ex-girlfriend because neither of us them could afford a one bedroom in NYC by ourselves themselves. Then I the guy got a raise and my ex the girl switched to a job that doubled her salary and moved into a studio apartment and dated a weird Ukrainian dude lived her life as she saw fit. Both of us are now making more money (and have more sensible living arrangements), and yet the statistics would show that the average household income decreased because there are two households instead of one.

Roald-Dahl-Essay-Sample
Charlie and the golden ticket adjusted to a household of seven.

People at every age group and education level are less likely to be married and have less kids than 40 years ago. High school educated people in their thirties have exactly one less child per household than in 1970, that leads to a lot more disposable income per person that isn’t going to show up in household income data. There’s an argument that households are smaller because people are poorer, but that doesn’t bear out in the data in any way. My ex isn’t living by herself because she’s too poor to marry, she’s living by herself because she’s rich enough to afford it and is an independent woman whose choices I respect.

Still, 4 people living together can live much cheaper than 4 individuals. There are various household size adjustments, the simplest is to divide income by the square root of the number of people. So, a married couple with two kids is assumed to have twice the expenses of my ex a single person.

forget to adjust for a million other things

Let’s start with the most obvious adjustment: inflation. If you double your income but everything you buy is twice as expensive, you’re no better off. So what’s the problem? Well, by some measures inflation has been much higher for rich Americans than for poor ones. That’s because poor Americans buy Chinese goods and shop at WalMart which between them keep prices really low. That’s also why every politician blames foreign trade and large corporations for inequality. Go figure.

breast-augmentation-2
It’s very important to adjust boobs for inflation

Rich people spend much more on services (accountants, vegan chef nannies) and luxury goods, which are both “produced” in high-wage countries and aren’t for sale at WalMart. They also spend a lot on positional goods, and so see much higher inflation. You could also argue that inflation of Ferraris is a result of rich people becoming richer and not a hardship. It’s a choose-your-own-deflator adventure, and the choice is mostly driven by the agenda of the person showing you the data.

Another factor is work hours: households in the top 20% by wage work twice as many hours (3,969 per year) as the bottom two quintiles (1,719 and 2,663). Are the rich people simply hard working folk that deserve the fruits of their labor? Maybe they’re just lucky to have cushy jobs? Or perhaps the top are all married working couples and the bottom is single moms with five kids who can’t even afford to work? Pick your narrative and adjust away!

ignore immigration

The US houses 35 million immigrants from the developing world, with another half million joining them every year. That’s more poor people coming every 3-4 years than the entire deluge of Syrian refugees accosting Europe. If all the American poor are getting richer and the poorest percentiles are replaced each year by broke immigrants (who are still much richer than they were in their home countries), this wholly positive trend wouldn’t be reflected in the overall income distribution statistics.

compare the USA only to European welfare states

And ignore the billion people lifted out of extreme poverty worldwide in just the last two decades. Globalization tends to decrease inequality worldwide, but increase it within each country. If Apple weren’t allowed to sell their iPhones abroad, its American engineers would be earning less and so would the Chinese workers who assemble the phones.

With all that said, here’s how the US actually stacks up to Europe (USA is the top line in navy blue):

world disposable income.png

Despite wildly varying social expenditures, income-wise the US looks remarkably like a Finland or a Norway except with a bunch of millionaires added on top of everyone else (it’s true that this chart doesn’t account for quality of public services). Is having a bunch of millionaires in your country a good or a bad thing? Whatever it is, it’s the single issue everyone in the US seems obsessed with.

fixate on the top

A curious feature of many inequality crusaders is their often single-minded focus on the Top One Percent, henceforth abbreviated TOP. (Rhymes with GOP. Snap!) Are most Americans’ lives affected by whether Warren Buffet owns two yachts or twenty? Or even by Google deciding to give it’s top developers a raise? There may not be a big house, a spot at a decent college and a rewarding job for every single American, but there are enough for more than 1% of the people. There are very few things that normal people compete for against the TOP.

One area where people maybe do compete with the TOP is buying congressmembers, but it’s actually pretty hard to buy political outcomes even with lots money. I think it would be much scarier if the TOP’s money was evenly split among 50% of the population. In that case, the same amount of money will come with a vastly larger voting constituency, to the point where the rich 50% could utterly dominate everything in the country to the detriment of the bottom half. Does the concentration of money in few hands helps these hands to coordinate effective class warfare? It seems far fetched. I wouldn’t know either way, I’m just one of the stupid sheeple manipulated by the elite.

What if the rich isolate themselves or fly away to a pristine space station like Elysium? Well, if the rest of us can’t manage to run a normal economy with the rich people gone then I guess we deserve to live in squalor.

There are perhaps good reason to worry specifically about the very rich getting richer, but the attention paid to the TOP seems out of whack with the severity of the problem. I was about to give a list of articles in popular outlets discussing inequality mainly through the lens of TOP, but then realized I didn’t see any that hadn’t. Did you know that the top 1% account for 57% of the words written about economic inequality? So unfair!

When someone says they’re talking about the middle class getting poorer, they often still talk about the rich getting richer. Here’s The Atlantic in a section titled “Goodbye Middle Class”:

A recent report by the Center for American Progress shows that in 1979, a majority of American households (59.5 percent) had earnings that qualified them as middle class (defined as working-age households with incomes between 0.5 and 1.5 times the median national income). In 2012, the share of middle class families had fallen to 45.1 percent, indicating that American households have become more concentrated at the top and bottom of the earnings ladder.

Why is that a problem? For one thing, mobility: More of the middle class is migrating to the lower class due to stagnant incomes and the increasing cost of living—which means more Americans are struggling to make ends meet. That’s not just bad for families; it’s bad for the economy.

And they present this chart:

atlantic small middle.png

I couldn’t replicate these numbers exactly, but the US census data shows the same trend. Using the “0.5-1.5 of median wage” definition, the middle class has indeed decreased from 53% in 1967 to 51% in 1979 to 45% in 2012.

In ’67, the 24-87th percentiles earned middle class incomes. In ’79 it was 24-85. In 2012 it was 24-79. The percent of households earning below half of the median wage has not changed one iota in half a century. The median wage itself could be dropping, which would be an actual problem, but the entire “decline” in the middle class is entirely caused by middle class people achieving escape velocity and becoming rich. Imagine the horror!

Even objective-seeming indexes of inequality mostly measure how rich the very rich are, since there’s not a lot of room on a national scale for the poor to be super poor. The most widely used measure of inequality is the GINI coefficient, which uses a simple formula to look at a distribution of income or wealth (a lower coefficient is more equal). Below I’ve simulated two economies, in the first one the income grows linearly: the bottom 10% make $10, the second decile makes $20 etc. In the second case, I doubled the income of the top decile but also increased the wages of the entire bottom 40% and compressed the entire distribution. Basically, the second economy has no poor people: only a huge middle class and few rich people who got richer. And yet, GINI got worse.

ginijpeg.jpg

With all that said, there is one really good reason to focus on the TOP in the US: they have a lot of money, and everyone else could use that money. Whether concentration of wealth impacts everyone negatively or not, and whether the TOP came by their riches by fair means or sinister, there’s a big opportunity and a temptation to redistribute that money around. Robin Hooding hits on two important topics: ethics and incentives. I’ll discuss both in part 3 of this series.


 

A shining example

Let me end that long list of cautionary tales with a heartwarming story of a country that embraced innovative policies to overcome terrible inequality. This large and diverse nation emerged from a century of frequent warfare facing serious economic issues. Most of its population lived in poverty while the TOP earned over 22% of the entire gross income. In their hour of inequality, the nation’s voters elected an anti-war party that also promised a complete social and economic overhaul.

Unlike most politicians, that party actually delivered. In the four decades that it was in charge, the income share of the TOP plummeted by more then half (to 9.9%) and so did the poverty rates. The income per capita, now much more equally distributed according to GINI, skyrocketed by a factor of 10. The country transformed from a poor backwater to a dynamic emerging market.

Where did that economic miracle occur? South Africa, 1948-1994. Either the secret to a thriving, equitable economy is complete racial segregation, or there’s something messed up with the way most people measure equality.

 Who wants some orange-tinted populist nationalism?


In part 2 in this series I’ll start by asking “what’s the actual harm of inequality?” in a utilitarian framework and focus on the data and measurements that are most relevant to that question. There will be a ton of charts.

In part 3, I’ll assess the utilitarian case for and against various ways to solve the bad things about inequality. I’ll also present some creative ideas that aren’t being discussed a lot because of the goofy way inequality is usually debated.

The Power of Power Skepticism

A quick estimate of experimental power can discern good science from bullshit. Too bad that some scientists forget to do it themselves.

I’m back from vacation! Here are some of the things we did abroad:

  1. Ate too much octopus. Or is it “too many octopods”?
  2. Bribed a cop.
  3. Watched quail chicks hatch.
  4. Got locked in an apartment and escaped by climbing from roof to roof.
  5. Poked a sea turtle.
  6. Endured getting stung by mosquitoes, jellyfish and fire coral.
  7. Swam in the sea. Swam in a cave. Swam in a river. Swam in a cave-river flowing into the sea.
  8. Snuck into a defunct monkey sanctuary whose owner was killed by a runaway camel and which is now haunted by mules and iguanas.

I returned from the wilderness tanner, leaner, and ready to expound about proper epistemology in science. This is going to get really technical, I need to get my math-cred back after writing 5,000 words on feminism, Nice Guys and living with my ex-girlfriend.


Effect-skeptic, not Ovulation-skeptic

This post is about gauging the power of experiments in science. I originally presented this  to a group of psychology grad students, as advice on how to avoid publishing overconfident research that will fail to replicate (I’ve never actually published a scientific article myself, but neither was I ever accused of excess humility). This subject is just as relevant to people who read about science: quickly estimating the power of an experiment you read about can give you a strong hint whether it’s a groundbreaking discovery or a p-hacked turd sandwich. Like other tools in the Putanumonit arsenal of bullshit-detectors, this is tagged defense against the dark arts.

I was inspired by Prof. Uri Simonsohn, data vigilante and tireless crusader against the dark arts of fake data, misleading analysis and crappy science. Simonsohn recently appeared on Julia Galef’s Rationally Speaking podcast and had a very interesting take on science skepticism:

Some people would say you really have to bring in your priors about a phenomenon before accepting it. I think the risk with that is that you end up being too skeptical of the most interesting work, and so you end up, in a way, creating an incentive to doing obvious and boring research.

I have a bit of a twist on that. I think we should bring in the priors and our general understanding and skepticism towards developing the methodology, almost blind to the question or the hypothesis that’s being used.

Let’s say you tell me you ran an experiment about how preferences for political candidates shift. Then I should bring to the table how easy it is to shift political preference in general, how noisy those measures are and so on, and not put too much weight on how crazy I think it is that you tell me you’re changing everything by showing an apple below awareness. My intuition on how big the impact of apples below awareness are in people is not a very scientific prior. It’s a gut feeling.

I don’t know that the distinction is clear, but when it’s my prior about the specific intervention you’re claiming, there I try not to trust my intuition. And the other one is, what do I know about the reliability of the measures, how easy it is to move the independent variable? There I do, because in the latter case it’s based on data andthe other one is just my gut feeling.

Here’s how I understand it: when you read a study that says “A causes B as measured by the Tool of Measuring B (ToMB)” you usually know more about B (which is some variable of general interest) and the accuracy of ToMB than you know about A (which is usually something the study’s authors are experts on). Your skepticism should be based on these two questions:

  • How easy it is to move variable B and by how much?
  • Is the sample size large enough to measure the effect with ToMB?

You should not be asking yourself:

  • How likely was A to cause the effect?

Because you don’t know. If everyone had a good idea of what A does, scientists wouldn’t be researching it.

As Simonsohn alludes to, political-choice research is notorious for preposterous effect size claims. Exhibit A is the unfortunate “The fluctuating female vote: politics, religion, and the ovulatory cycle” which claims that 17% of married women shifted from Obama to Romney during ovulation. This should smell like bullshit not because of the proposed cause (ovulation) but because of the proposed effect (voting switch): shifting even 1% of voters to the opposite party is insanely hard. Reason doesn’t do it. Self-interest doesn’t do it. I suspect that we would notice if 10 million women changed their political affiliation twice a month.

5-ways-to-tell-youre-ovulating-article

Finding an effect that large doesn’t mean that the experiment overstated a small effect, it means that the experiment is complete garbage. +17% is basically as far away from +1% as it is from -1% (effect in the opposite direction). An effect that’s 20 times larger than any comparable intervention doesn’t tell you anything except to ignore the study.

Per Simonsohn, the ovulation part should not be a reason to be doubtful of the particular study. After all, the two lead authors on “fluctuating” (both women) certainly know more about ovulation that I do. In fact, the only thing I know of that correlates highly with ovulation is the publishing of questionable research papers.


Stress Testing

For scientists, the time to assess the possible effect sizes is before the experiment is conducted, not afterwards. If the ovulation researchers predicted an effect of +0.5% before doing the study, the +17% result would’ve alerted them that something went wrong and spared them from embarrassment when the study was published and when it was (predictably) contradicted by replications.

Estimating the effect size can also alert the researchers that their experiment isn’t strong enough to detect the effect even if it’s real, that they need to increase the sample size or design more accurate measurements. This could’ve prevented another famous fiasco in psychology research: the power pose.

2 posing
Carney, Cuddy and Yap 2010

“Power posing” is the subject of the second most popular TED talk of all time. I got the link to it from my ex-girlfriend who eagerly told me that watching it would change my life. I watched it (so you shouldn’t). I read the original paper by Carney, Cuddy and Yap (CC&Y). I offered to bet my ex-girlfriend $100 that it wouldn’t replicate. Spoiler alert: it didn’t, and I am currently dating someone with superior scientific skepticism skills.

Let’s see how predictable the powerlessness of power posing was ahead of time. CC&Y claim that holding an expansive “power” pose not only increases self-reported “feeling of power” but also affected lowered levels of cortisol – a hormone that is released in the body in response to stress and affects functioning.

We want to find out how cortisol fluctuates throughout the day (which would affect measurement error) and how it responds to interventions (to estimate effect size). A quick scholar-Googling leads us to a 1970 paper on the circadian cortisol pattern in normal people. It looks like this:

Cortisol fluctuation Weitzmann
Weitzmann et al. 1970

Cortisol levels vary daily over the range between 5 and 25 µg/dl (shown on the chart as 0-20). Daily mean cortisol levels vary by about 4 µg/dl person-to-person (that’s the standard deviation) and measurements 20 minutes apart for the same person vary by about 2.5  µg/dl. There’s also an instrument error (how accurately taking a saliva sample measures actual cortisol levels) which is too annoying to google. Since CC&Y measure the difference in cortisol levels between two groups of 21 people 17 minutes apart, their standard error of measurement should be around:

\sqrt{\frac{4^2 + 2.5^2 + something^2}{21}} \approx 1.2

Ideally, to measure the effect of anything on cortisol that effect should be at least 3 times the measurement error, or around 3.6. A source for possible effect size estimates is this work on caffeine, stress and cortisol by Lovallo 2008.

cortisol coffee
Lovallo 2008

The caffeine paper uses a slightly different measurement of cortisol but we can superimpose the 1-5 range in the chart on the normal daily 5-25 µg/dl level. Best I could tell, drinking a ton of coffee affect cortisol levels by 4 µg/dl. More interestingly, the subjects in Lovallo’s study underwent a “stress challenge” at 10 am specifically designed to raise their cortisol. Which it did, by around 2 µg/dl. I may be willing to accept that 1 minute of posing has half the effect of a 30 minute stress challenge on stress hormones, but no more. That means that by having only 42 participantsCC&Y are looking for a 1 µg/dl effect in a measurement with 1.2 µg/dl error. These numbers mean that the experiment has 13% power (I’m perhaps too generous) to detect the effect even at the weak p=0.05 level. The experiment has a 20% chance to find an effect with the opposite sign (that power poses raise cortisol instead of reducing it) even if CC&Y’s hypothesis is true.

Unless something is inherently broken with the measurement methodology, the straightforward solution to increase experimental power is to increase the sample size. How much does it cost to pay an undergrad to spend 20 minutes in the lab and spit in a tube? A cynic would say that the small sample size was designed to stumble upon weird results. I don’t know if I’m that cynical, but the replication recruited 200 subjects and found that the effect on cortisol is as follows:

corti rep.png

When an underpowered experiment finds a significant effect, it’s rarely because the scientists got lucky. Usually, it’s because the experiment and analysis were twisted enough to measure some bias or noise as a significant effect. It’s worse than useless.


Too Late for Humility

There’s nothing wrong with running an underpowered experiment as exploratory research – a way of discovering fruitful avenues of research rather than establishing concrete truths. Unfortunately, my friends who actually publish papers in psychology tell me that every grant request has “power = 80%” written in it somewhere, otherwise the research doesn’t get funded at all. A scientist could expend the effort of calculating the real experimental power (13% is a wild guess, but it’s almost certainly below 20% in this case), even if that number is to be kept in a locked drawer. If she does, she’ll be skeptical enough not to trust results that are too good to be true.

Here’s a beautiful story of two scientists who got a result that seemed a tad too good for the quality of their experimental set-up. They stayed skeptical, ran a replication with pre-planned analysis (the effect promptly disappeared) and spun the entire ordeal into a publication on incentive structures for good science practices! Here’s the under appreciated key part in their story:

We conducted a direct replication while we prepared the manuscript. 

Humility and skepticism have a chance to save your soul (and academic reputation), but only until you are published. Perhaps Carney and Cuddy would’ve agreed with my analysis of the 13% power, but there’s no sign that they did the calculation themselves. Even if they originally wrote “80%” just as a formality to get funded, once they neglected to put a real number on it nothing will keep them from believing that 80% is true. Confirmation bias is an insidious parasite, and Amy Cuddy got the TED talk and a book deal out of it even as all serious psychologists rushed to dismiss her findings. As the wise man said: “It is difficult to get a man to understand something, when his salary depends on his not understanding it. ” 

p val precious

In her TED talk, Cuddy promises that power posing has the power to change anyone’s life. In a journal reply to the replication by Ranehill, she’s reduced to pleading that power posing may work for Americans but not Swiss students, or that it worked in 2008 but not in 2015, or seriously arguing that it only works for people who have never heard of power posing previously, making the TED talk self-destructive.  If you’re left wondering how Carney, Cuddy and Yap got the spurious results in the first place, they oblige to confess it themselves:

Ranehill et al.used experimenters blind to the hypothesis, and we did not. This is a critical variable to explore given the impact of experimenter bias and the pervasiveness of expectancy effects.

Let me translate what they’re saying:

We know that non-blind experiments introduce bogus effects, that our small-sample non-blind experiment found an effect but a large-sample blind study didn’t, but yet we refuse to consider the possibility that our study was wrong because we’re confirmation biased like hell and too busy stackin’ dem benjamins.


Questioning Love and Science

Let’s wrap up with a more optimistic example of using power-estimation to inform us as readers of research, not scientists. At the end of Love and Nice Guys I mentioned the article on the 36 questions that create intimacy, itself based on this research by Arthur Aron et al. Again, we withhold judgment on the strength of the intervention (the 36 questions) and focus on the effect, in this case as measured by the corny-sounding Inclusion of Other in the Self Scale (IOS).

A search of interventions measured by IOS leads us to the aptly-titled “Measuring the Closeness of Relationships” by Gachter et al. which includes the following table:

IOS diffs.png

The IOS measures an equal difference (1.4-1.5) between good friends and either the “closest, deepest, most involved, and most intimate relationship” or on the other hand an “acquaintance, but no more than an acquaintance”. The SD of the scale is 1.3, and since Aron has 50 people in each group (36 questions vs. just small talk) we divide 1.3 by the square root of 50 to get a standard error of 0.18. To achieve 80% power at p=0.05 (80% at .05 should be the bare minimum standard) the effect of an hour discussing the 36 questions should be 0.5, or roughly one third of the distance between acquaintances and friends. (Here’s a power-calculator, I just use Excel).

An intimate hour taking people one third of the way to friendship doesn’t seem implausible, and in fact the study finds that the intimacy-generating 36 questions increase IOS by 0.88: high enough to be detectable, low enough to be plausible. We don’t know if the IOS is really a great measure of intimacy and potential for love, but that’s outside the scope of the study. They found the effect that was there, and I expect these findings to replicate when people try to confirm them. Putanumonit endorses asking potential partners “how do you feel about your relationship with your mother?” (question 24) as trustworthy science. The whole of the googling and power-calculating took me just under half an hour.

I’m not this energetically skeptical about every piece of science news I hear about, but I am more likely to be suspicious of research that pops up on my Facebook wall. If you read about a new study in the NY Times or IFLScience, remember that it’s covered because it’s exciting. It’s exciting because it’s new and controversial, and if it’s new and controversial it’s much more likely than other research to be flat out wrong. If you think of power-posing every day or planning to seduce the man of your dreams with questions, 30 minutes of constructive skepticism can keep you out of the worst trouble.

 

The Lukewarm Hand

What’s worse, the hot hand fallacy or the fallacies committed by scientists researching hot hands?

I’m struggling a bit with the next post in the dating sequence. I can’t put any numbers in it, only some pretty personal stories and questionable ideas that touch on touchy subjects. I’m really not sure I should even write it. That leaves me with two options:

  1. Not write it.
  2. Before I write it, scare my entire readership away with a 4,000 word mathematical geek out about a 30 year old research paper, that’s going to include a Monte Carlo simulation of 320,000 jump shots.

If you know me at all, you know this isn’t much of a dilemma.

This post is tagged rationality  because it deals with biases in decision making. It’s also tagged defense against the dark arts: we’re going to see how even good research papers get statistics wrong sometimes, and how smart statistical skepticism when scrutinizing science studies can save your skin.


Who Has the Hot Hand?

Steph Curry hits a 3 pointer, the crowd cheers. The next trip down the floor he hits another, from the corner. The buzz in the building rises in pitch, another shot with a hand in his face… swish! The crowd is standing now, screaming, everyone feels that Curry is on fire. “The basket is as big a barn door to Steph now,” the announcer is giddy, “He can’t miss!” Curry catches the ball at the top of the arc and releases another three pointer, the ball curves smoothly through the air. There’s no way he’s going to miss, is there?

Curry_3

The hot-hand fallacy is the intuitive tendency to assume that people will continue to succeed after a row of successes, even when success and failure come from a random process. It was was explained in Gilovich, Vallone and Tversky in 1985, who concluded that there’s no evidence for “hot hands” or “streak shooting” actually happening in basketball. They classified the belief as a fallacious heuristic – a mistaken judgment. The publication followed a stunning string of groundbreaking papers on heuristics and biases by Amos Tversky, who won the Nobel Prize later for his work with Daniel Kahnemann. I guess everyone just kinda assumed that Tversky’s hot streak must keep going and he can’t be wrong on this one.


Gamblers vs. Streakers

The first weird thing about the hot-hand fallacy (HHF) is that it’s supposed to be the same thing as the gambler’s fallacy (GF), except with precisely the opposite outcome. Gambler’s fallasticians believe that a coin that landed heads several times in a row is more likely to land on tails on the next flip because it’s “due”. GF really victimized the roulette players who kept betting on red when the wheel landed on black 26 times in a row in Monte Carlo in 1913. Wouldn’t it have actually made sense for these gamblers to switch to “hot hand betting” and double down on black? Besides being the intuitive thing to do after seeing 20 blacks in a row, pure Bayesian rationality would also seem to point that way. My prior on the roulette wheel being broken to bias black is not large, but it’s larger than 1 in 1,000,000. 1 in a million is less than the odds of seeing 20 spins on the same color in a row with a fair wheel. After 20 blacks, I wouldn’t be going back.

Supposedly, both GF and HHF come from the representativeness heuristic: people don’t believe a 50-50 variable should have long streaks of one outcome because heads-heads-heads-heads-heads doesn’t look like a fair coin. If the coin is known to be fair, people believe that the next flip “must” land tails to make the streak more “fair like” (GF). If the “coin” is actually Steph Curry, people decide he’s not actually 50-50 to hit his next shot but rather that he morphed into a 90% shooter (HHF). So: GF for objects and HHF for people.

But wait, do people expect Steph to continue hitting 90% for the rest of his career? Of course not! After a single miss everyone will assume he’s back to 50-50. I’m still not sure how, according to the theory, people decide if they’re going to HHF and assume a Curry basket or GF and assume he’s due for a brick. Maybe they flip a fair coin to make up their mind.

Croson and Sundali (2005) figured out how to expense a fun weekend to their research grant and took a trip to a Nevada casino to see the fallacies in action. Gambler’s fallacy with actual gamblers is a slam dunk: after six spins on the same color, 85% of roulette players bet the other way. The sample size isn’t huge, but the data seems unequivocal:

gamblers fallacy.png
Croson and Sundaly (2005)

The evidence for hot hand abuse is a bit slimmer.

Croson and Sundaly: Of our 139 subjects, 80% (111) quit playing after losing on a spin while only 20% (28) quit after winning. This behavior is consistent with the hot hand; after a win players are likely to keep playing (because they’re hot).

The information above is utterly useless without knowing the base rate of winning on a spin, if 10% of spins won it would mean than people quit more after winning! I figured out that number from other data in the paper: it’s 33% (1024 / 3119). Also, 100% of the times someone quits because they ran out of chips happen after a losing spin. How many of the people who quit did so because they lost their last chip? If it’s more than 13% (the difference between the 67% who lost their last spin and the 80% who quit after losing), the conclusion isn’t only annulled, but reversed.

Spoiler alert: research papers that contain the data of their own refutation is going to be a theme today. Stay tuned.

hot hand regThe main support for the hot hand effect comes from this regression on the right. People who just won place 1 additional bet, 12 on average instead of 11, when controlling for bets placed on the first and previous spins. The researchers couldn’t directly observe the amounts, just the number of different bets, and used that as a proxy. Obvious counterargument is obvious, Croson and Sundaly make it themselves:

There are alternative explanations for these behaviors. For example, wealth effects or house money effects might cause an increase in betting after a win. In our empirical data we will not be able to distinguish between these alternative explanations although previous lab experiments have done so. 

When people are doing something weird, I often prefer to assume irrationality rather than conjuring convoluted utility functions to explain away the behavior (can I call that the Economist’s Fallacy?) In this case, however, putting one extra bet after having won seems quite reasonable. That’s how you would bet if, for example, you were trying to manage risk in order to play for a fixed number of spins and then get lunch. Whatever you think of the rationality level of people spending their day at the roulette table, it’s really hard to see much hot-hand fallacy there.


Shooters Gonna Shoot

Bocskocsky, Ezekowitz and Stein (2014) couldn’t get a trip to Vegas approved, so they went back to basketball armed with some great data that wasn’t dreamt of in 1985: cameras that track each player and the ball every second of the game. Without the distraction of Vegas, the researchers first developed a full model of shot difficulty that incorporates everything from the angle of the nearest defender to the time remaining in the game.

Armed with that model, the paper shows that “hot” players take slightly more difficult shots (e.g. from further away and in the face of tighter defense). Controlling for difficulty, players do shoot better after a few makes but not enough to make up for the increased difficulty. However, if a player who just made two shots takes another one of the same difficulty fans are right to expect a 2.4% better chance of sinking the shot compared to a player coming off two misses.

dirty harry
I just made my last three shots, so you’ve gotta ask yourself one question: “Do I feel lucky?” Well, do ya, punk?

The bottom line is, the study is excellent but not very exciting. It concludes that the hot-hand isn’t a fallacy while also calculating that players are better off shooting less after a streak. Fans are justified to expect a hot streak to continue if they adjust for shot difficulty, but not if they don’t.

Finally, the Andrew Gelman-shaped Angel of Statistical Skepticism (ASS) on my shoulder reminds me to watch out for beautiful gardens that hide many forking paths. Even being completely honest and scrupulous, the researchers have a lot of small choices to make in a research project like this. Which of the dozens of variables to include in assessing shot difficulty? Which measures of “heat” to focus on? Which parameters to include the regression? Every choice makes perfect sense in the moment, but the fact is that those choices were available. A slightly different data set could have pushed the researchers towards doing a slightly different analysis that would’ve found statistical significance for some other result. A tiny effect size plus a multitude of “researcher degrees of freedom” make me think that the 1% p-value on the main finding is probably no better than a 5%, and 5% p-values are wrong at least 30% of the time.

I think that Bocskocsky, Ezekowitz and Stein did a great job and I certainly don’t believe they were in any way dishonest, but I’d be very happy to bet at 100-1 odds that their 1% p-value will not replicate.


The Hot Hand Bias Bias

Why even spend hours fitting models to data when you can do some arithmetic and turns other people’s data against itself? Miller and Sanjurjo (2015) did just that, and almost made $200,000 off a hedge fund guy while at it.

Miller and Sanjurjo noticed that even for a perfectly random variable, any limited sequence of observation is likely to show an anti-hot-hand bias. This confounds attempts to detect hot hands and contributes to the gambler’s fallacy. For illustration, lets look at sequences of 3 basketball shots. We assume that every player hits 50% of their shots, so each one of the 8 sequences is equiprobable. For each sequence (imagine it’s 8 players), we’ll calculate the percent of made baskets followed by another make.

three shots.png

We assumed that every player hits 50% of their shots no matter what, but somehow the average player makes 41.7% of their shots after a made basket! The discrepancy comes from the fact that 50% is averaged across all shots, but 41.7% is averaged across all players. Changing the aggregation or averaging level of your data can not only mess up your finding, but also flip it and reverse it.

If you bet against streaks continuing on the roulette, you will win most days but on the few days you lose, you’ll lose a lot. If, like Gilovich and Tversky, you look at a lot of basketball players, most players will appear to shoot worse after a streak but the few that shoot better will shoot much better. That better percentage will also continue over more shots since those players will have more and longer streaks.

Gilovich let 26 basketball players from the Cornell varsity teams shoot uncontested jump shots from a distance at which each players shoots 50%. He found an insignificant 4% increase in shooting after 3 makes vs. after 3 misses. Miller and Sanjurjo apply their correction to the original 1985 data, and calculate an implied difference of 13%!

The only question is, why apply corrections to poorly aggregated data when we can just change the aggregation level directly?


Data of Their Own Demise

To their credit, Gilovich, Valone and Tversky not only went out to the gym with the varsity teams (can you imagine Calipari’s Wildcats participating in a statistics study?) but also provided the full data of their observations and not just the percentages:

gilovich cornell.png

As we saw, averaging across all players finds a gap of 4% (49% vs. 45%) in shooting after a hot streak vs. a cold streak. The numbers in parentheses are the actual shots taken, using these along with the shooting percentages allowed me to reverse engineer the data and calculate total makes and misses after streaks.

  • After 3 misses: 161 out of 400 shots = 40%.
  • After 3 makes: 179 out of 313 shots = 57%.

That 17% is a humongous difference, equal to the difference in 2-point shooting between the second best player in the NBA this season and the fourth worst. The difference disappears in the original study because of aggregation levels. When you aggregate by players, the super-streaky Male #9 (48% gap) counts the same as his consistent friend, Male #8 (7%). However, dude #9 took four times as many post-streak shots as his buddy, when that data counts four times as much the shooting gap emerges clear as day.

Gilovich also looks at free throw shooting data by the Celtics and again goes to considerable lengths to avoid seeing evidence of hot-hand shooting:

gilovich celtics.png

Gilovich starts by asking a bunch of supposedly ignorant and biased basketball fans to estimate the shooting percentage of a 70% average free throw shooter after a make and a miss. They estimate an average gap of 8%: 74% vs. 66%. Instead of looking at the gap directly, Gilovich calculates a correlation for each player, finds that none of them are significant, and happily proclaims that “These data provide no evidence that the outcome of the second free throw is influenced by the outcome of the first free throw” (Gilovich et al. 1985).

larry-bird-shooting-free-throw.jpg
Larry Bird concentrates as he calculates serial regressions in his head.

If you ask me, the evidence that the data provide is that players hit 428/576 shots after a miss (74.3%) and 1162/1473 after a make (78.9%) for a nice 4.6% gap.

Oh no, Gilovich objects, not so fast: “Aggregating data across players is inappropriate in this case because good shooters are more likely to make their first shot than poor shooters. Consequently, the good shooters contribute more observations to P (hit/hit) than to P (hit/miss) while the poor shooters do the opposite, thereby biasing the pooled estimates” (Gilovich et al. 1985). 

Good point there, Dr. Gilovich, but remember that you asked the fans about 70% shooters specifically. We can avoid the good shooter/bad shooter bias by grouping players with identical FT%. As fate would have it, Paris, Ford, McHale and Carr all shoot between 70.5% and 71.2%: almost identical and close to 70% (I calculated each player’s exact shooting data from the number of shots and percentages in the table). These four players shoot 3.2% better after a make than after a miss.

Is a 3-4% gap significant? Who cares, the word “significant” is insignificant. A pernicious mistake that scientists constantly make is assuming that every rejection of the null is confirmation for the alternative. The fact that the data is unlikely under the null hypothesis doesn’t mean it’s any likelier under some other model. Here, Gilovich et al. make the flipped mistake: assuming that failure to reject the null hypothesis (0% gap after a make) confirms the null is true. However, the naive alternative (fan estimate) was an 8% gap. You can calculate p-values from now till the Sixers win, it doesn’t change the fact that 4% is as close to 8% as it is to 0%. The kind of statistical malpractice where a 4% result rejects the 8% hypothesis and confirms the 0% one is why some Bayesians react to frequentists with incandescent rage.

Rage aside, I’m left with a dilemma. On the one hand, disagreeing with Amos Tversky probably means that I’m not so smart. On the other hand, the Cornell students shot 17% better after a streak of makes and Tversky’s friends concluded “no effect”. Screw it, argument screens off authorityThe hot-hand fallacy is dead, long live the hot hand!


The Streak is the Signal

Summary so far: research paper that claims that hot-hand shooting exists finds a 2% improvement in shooting after a streak, research paper that claims that hot-hands are bullshit finds gaps between 3% and 17%. Science FTW!!!

Even if the data was straightforward, it’s still just correlations and regressions. Without a plausible mechanism to explain the effect I trust it only as far as I can throw it. So why does hitting shots make you hit more shots? The announcers usually babble something about confidence or “being in the zone”, but I can’t throw announcers really far and I trust their analysis even less. If you’ve seen Steph Curry or Larry Bird shoot, you wouldn’t doubt that they’re 100% confident in every single shot they take.

It turns out there’s a remarkably simple answer that accounts for the hot-hand effect: all you need is a player having a priori different shooting percentages in different games. The simplest model assumes that each shot a player takes has the same odds of going in, but what if a player has games where something makes his shooting percentage higher or lower independently of streaks?

Kawhi Leonard, Kevin Durant

Let’s look at Kevin Durant, a dude who’s pretty good at shooting basketballs. He takes 20 shots a game and makes 50% of them over a season. In a specific game, however, Durant may have defensive player of the year Kawhi Leonard inside his shirt and shoot 32%. The next game, he’s guarded by octogenerian Kobe Bryant and something called Larry Nance Jr., he shoots 78%. Even if we assume that Durant’s shooting percentage doesn’t change throughout the game, in games where he shoots a higher percentage he’ll also get more streaks, and more attempts at shot-after-streak.

To see this in action, I simulated 1,000 games for Durant and counted the shots made and missed after 3 hits in a row. I simulated 20 shots in each game, but in 500 of them his shooting percentage is set to 60% and in the other 500 it’s set to 40%.

# of Games FG% Streaks of 3 Make after streak Miss after streak After streak FG%

500

60% 1872 1124 (60%) 748 (40%) 1333 / 2423 = 55%
500 40% 551 209 (38%)

342 (62%)

The chance of making a shot after a streak is either 60% or 40% depending on the game, but more than three quarters of the streaks happen in the 60% games. Every shot made after a streak gives another opportunity for a hot-hand shot, in a couple of the simulated games Durant makes 9 or 10 shots in a row! Because of that, even though his overall shooting percentage is exactly 50%, Durant’s shooting percentage after a streak is 55%. The fans are justified to expect a hot hand after 3 makes: the streak doesn’t cause the higher scoring chance, but it sends a signal that Durant is having a high FG% game. We have a perfect explanation for hot hands without any (hot) hand waving about “confidence” and “zones”.


Deviations of Deviations

The “variable shooting” theory is simple, elegant and explains the hot-hand shooting gap perfectly. Researchers take note: if you have a beautiful theory, don’t risk it by exposing it to ugly data! Oh, what the hell, I’m not getting paid for this anyway.

We can’t directly tell from someone’s shooting success what the underlying percentage was in a particular game, and we’re looking for evidence that the underlying percentage actually differs from one game to another. A consistent 50% shooter (no variability in underlying percentage) will still hit 3/9 on a bad day or 12/16 on a lucky outing. However, he’ll have less games where he shoots a number that’s very different from 50% than someone who alternates games with 60% and 40% underlying probabilities. We can find indirect evidence for game-to-game fluctuations by looking at how variable the game-to-game actual shooting percentage is. The higher the observed variance, the more evidence is shows for underlying variance. The question is, how much higher should it be?

A player’s field goals in a game follow a Binomial Distribution with parameters n=number of shots and p=underlying FG% for that game. The variance of a binomial variable is n \cdot p \cdot (1-p) . The variance of the actual shooting percentage outcome is \frac{p \cdot (1-p)}{n} .

The leader in 2-point field goal attempts last season is LaMarcus Aldridge, who made 47.5% of his shots and took 18.45 attempts per game. If his underlying FG% was always 47.5%, the variance in his shooting percentage would be:

Var = \frac{p \cdot (1-p}{n} = \frac {0.475 \cdot 0.525}{18.45} = 0.0137

The standard deviation we would see over a season would be:

\sigma = \sqrt{Var \cdot \frac{N}{N-1} } = \sqrt{0.0137 \cdot \frac{73}{72} } = 11.7\%

lamarcus

If instead of a steady 47.5% LaMarcus shoots either 10% above or below that number (57.5% in half his games and 37.5% in the other half), the variance would increase by (0.575 - 0.475)^2 = 0.01 and the standard deviation would increase from 11.7% to 15.3%. More reasonably, if he deviates from his season average FG% by 5%, the variance would increase by 0.0025 and the standard deviation by 1%, from 11.7% to 12.7%. That 5% game-to-game difference should be enough to create the 2% hot-hand improvement found by Bocskocsky et al.


Invariable Shooting

11.7% standard deviation in game-to-game FG% isn’t a perfect estimate of the actual variability because the number of shots a player takes changes each time and that pushes the variance higher. However, if a players underlying percentage goes up and down by 5% we still expect see about a 1% increase in game-to-game standard deviation relative to the baseline case in which he enters each game with a constant underlying FG%. To figure out that baseline, I looked at the top 10 players from last season in 2-point attempts (2PA) and simulated each of their seasons 20 times. For each game, I kept the actual number of attempts fixed but generated a random number of makes using the player’s season-long 2-point shooting percentage (2P%). All the data are from the magnanimous treasure trove of Basketball-Reference.com.

For example, Aldridge made 8 out of 24 shots (33%) on the last game of the 2015 season. His season long 2P% was still 47.5%, so in my 20 simulations he hit 9, 4, 9, 10, 7, 8, 14, 12, 12, 9, 12, 9, 10, 5, 10, 8, 9, 11, 6 and 12 of his 24 shots. I took the game-to-game shooting percentage deviation in each simulated season and averaged these to get the baseline deviation. I compared this to the player’s actual game-to-game deviation, I looked for the latter to be around 1% for most players.

Player 2P% 2PA Baseline Deviation Actual Deviation Baseline – Actual 2PA-2P% correlation
LaMarcus Aldridge 47.48% 18.45 11.99% 12.21% 0.22% .02
Nik Vucevic 52.42% 16.22 13.05% 11.28% -1.77% .17
Anthony Davis 54.00% 17.46 12.73% 15.82% 3.09% -.02
Russ Westbrook 45.82% 17.66 12.79% 12.62% -0.17% -.04
Pau Gasol 49.51% 14.45 13.21% 13.01% -0.20% .04
Blake Griffin 50.40% 16.70 12.40% 12.40% 0% -.16
Monta Ellis 48.69% 13.54 14.35% 16.62% 2.27% .12
Boogie Cousins 46.88% 17.93 12.91% 12.56% -0.35% -.10
Marc Gasol 49.95% 13.02 14.40% 12.06% -2.34% .32
Ender Wiggins 45.30% 12.32 14.46% 12.33% -2.13% .11

Shit, I really liked that theory.

Only 2 out of 10 players have actual game-to-game variance that’s significantly higher than the baseline, and 3 have a much lower one! Three explanations come to mind:

  1. I messed up the math or the simulation, you can spot the error and earn yourself a gift.
  2. Statistical coincidence, 10 players is a small sample, shit happens.
  3. Some mechanism is adjusting these players’ 2P% back to the mean within a game.

An example of #3 would be if players who start the game shooting well continue by taking more and worse shots, just like Bocskocsky saw happening after a streak. In fact, a high FG% game likely has streaks of makes after which the player will take bad shots and turn the high FG% game into an average FG% one. We can at least see evidence for these players shooting more often by looking at the correlation of their 2P% with attempts. Indeed, all three shoot more when they shoot well (right column).

Does that explanation sound plausible? That’s how bad science practice sounds like: alluring, seductive, and oh-so-reasonable. A post-hoc just-so story with little support in the data is still a crappy post-hoc just-so story if I came up with it myself. The bottom line is that I spent hours on that simulation and didn’t learn much of use, but I’ll be damned if I succumb to publication bias on my own blog.

warning bad science


Conclusion

Here’s what I learned from a week of digging into the dirt of hot hand research until my own hands got tired and bloody:

  1. Once we account for shot difficulty (or in cases like free throws where difficulty isn’t a thing), players shoot a bit (2%-4%) better after making a few shots in a row. Probably.
  2. Neither gamblers nor basketball fans are horribly confused by the “hot hand fallacy“, if they overestimate the chance of a successful streak continuing it’s not by much. Possibly.
  3. Science is hard. If you have a lot of analysis choices available, it’s very easy to let them lead you down a path of mirages. In the worst cases, your choice of analysis can lead you away from good conclusion (17% gap in Cornell!) and towards bad ones. Certainly.

Science – turns out it’s even harder than dating.

I Smell a Chart

I analyze a statistics chart that manages to conclude the opposite of what the data says by making every possible error.

You can draw a straight line through any three points, if you use a thick enough marker. ~Old Russian joke

Defense Against the Dark Arts

It is my job to arm you against the foulest abuses of statistics known to mankind! You may find yourselves facing meaningless p-values, misleading charts and baseless inferences in this room. Know only that no harm can befall you whilst I am here. All I ask is that you remain calm.

Now – be warned! The world of dark statistics is nasty and treacherous. Even while learning to combat the corruption one often becomes himself tainted by the stupidity. So, before we dissect this affirmative action story from fivethirtyeight, some dire warnings:

Beware the statistician’s fallacy. life is messy and stats are hard. An experienced and motivated statistician will be able to find a nitpick in almost any article that uses statistics and dismiss the article’s conclusion while ignoring all the supporting evidence. I picked the above example not because the chart doesn’t present a conclusive case for its conclusion, but because it offers literally zero support for it.

Pick on someone your own size. It’s not hard to find people being stupid with statistics, I wouldn’t write about this story if it wasn’t from fivethirtyeight, an outlet I recently expressed my admiration for.

If you think affirmative action sucks – remember that reversed stupidity isn’t intelligenceA weak story misrepresenting the effects of affirmative action doesn’t “debunk” or discredit the entire endeavor. If you feel satisfaction because you’ve seen a critique of a weak argument for an opponent’s position while ignoring the strong ones, that’s the feeling of becoming stupider.

If you think affirmative action rocks – bad arguments hurt a good cause. First of all, it would be strange if someone values affirmative action for its own sake rather than valuing diversity and equality of opportunity. If you care about the latter, wouldn’t you be interested to learn how effective various policies are in promoting equality? In any case, supporting a bad argument is dishonest and makes your entire cause all the easier to dismiss. If a liberal friend of mine uses the Bible to justify allowing refugees, or a conservative friend claims that the refugee story is a corporate conspiracy to ensure cheap labor, I know they’re just in it for the signalling and their opinions can be safely ignored.

Remember, kids, epistemic virtue before statistics!

The goal of this post isn’t to say anything at all about affirmative action (AA) policies, but to show how a chart and a piece of data analysis can go terribly, horribly wrong.


Chart Forensics

Here’s the offending chart from Hayley Munguia’s article:

aa_hispanic

At first glance, nothing insidious seems to be going on. Some grey dots, some red dots and some lines showing a relationship between them. We’ll need to break this chart into pieces to see how every chunk is individually wrong, and how they combine into a true abomination of data science.

Title

aah_title

Clear, informative, straightforward. The only problem with the title is that it was written before the author actually looked at the data, and the data unfortunately refused to cooperate.

X-Axis

aah-xaxis

Hold on, AA kicks in only when someone applies to college, why are we looking at share of state population instead of college applicants? Hispanics drop out of high school at a much higher rate (15% vs. 10% for blacks and 5% for whites). I’m pretty sure high school drop outs don’t apply to college. On the flip side, perhaps more Hispanics who are hard-working but marginal scholars make the smart choice of pursuing a vocational career instead of wasting four years for a useless degree with negative ROI and a mountain of debt.

Y-Axis

aah-yaxis

This measure makes slightly more sense, but the number we’re interested is the acceptance rate of Hispanics, whether relative to number of applicants or population. That in itself should be one of the axes, probably the only one. Wouldn’t the chart be much clearer if it looked like this:

aah-better.png
Data is made up, this is an illustration of a non-obscurantist chart.

In fact, I’m afraid a simple box plot would make the story so clear that there wouldn’t be an article at all.

Once the axes don’t make sense and don’t measure what you’d want to measure, nothing is there to stop a torrent of baffling obfuscation.

Trend Lines

aah-trend.png

A trend line is useful when we are looking to extrapolate a point outside it a data set from the information inside it . For example, predicting an unknown future value from a past time series. Nothing like that happens here: the entire USA is inside the data set and not outside it.

The article has a chart for black students in which the “ban” trend has a higher slope, does this mean that banning AA is good for blacks? Should we extrapolate that a hypothetical state with a ban on AA and 75% black population will have 90% black college students? We’ll get back to the trend line later.

Data Points

aah-point

It could be argued that the trend of enrollment vs. population size shows how larger populations of Hispanics are affected. If so, why not account for population directly with a single point for each state or each million people? Instead, each data point shows the enrollment in separate colleges, and each college is given the same weight regardless of size.

aah-texas
Texas public research universities

Texas is the third column from the right, it’s shown in gray as a state that (currently) has AA. It also has a lot of gray dots very high on the chart (making AA look good) which immediately got me suspicious. Texas’ two largest universities, Texa A&M and UT Austin sit at 15% and 21% Hispanic enrollment respectively, well below the “trend” line. The two universities at the top, UTRGV and UTEP have a combined enrollment about equal to UT Austin, but count as two different data points.

That doesn’t mean they affect the results twice as much, it’s much worse than that: by all appearances, the “trend” lines are a product of simple linear regression, which is calculated using the method of “least squares“. Without getting too technical, each point “pulls” the regression line towards it with a strength proportional to the square of it’s distance from line (was that too technical?). A point that is twice as far from the line pulls 4 times as hard. Points that are really far away from the line are outliers, and have a huge influence on the slope of the line. In cases with outliers it is wiser to exclude those points or avoid using least squares altogether.

Without UT Austin (21%) and UTEP (73%) the regression line at Texas is around 28%, which is much closer to UT Austin. This means that UTEP as a whole has 41 times as much influence as UT Austin, each student at UTEP has 95.8 times as much influence!

\frac{(73\%-28\%)^2 \times 51,000}{(21\%-28\%)^2 \times 22,000} = 95.8

Besides accounting for size, the data can also be aggregated at different levels: by state, by college, all ban/non ban states together etc. Using the wrong aggregation level can mistakenly lead to the opposite interpretation of the actual data, in what is known as Simpson’s paradox. Ironically enough, the most famous example of the paradox in action was in a controversy about college admissions.

In 1973 UC Berkeley (which we’ll get back to) was sued for gender discrimination because it admitted 44% of male applicants but only 35% of female ones. However, when looking at individual departments, the majority were likelier to admit a woman. The secret? More women applied to extremely competitive departments like English (7% admission) while men applied more to less selective departments like chemistry (65%). Since the departments make their own admission decisions, grouping by department gave the correct conclusion that there was no bias against women.

Banning AA is a state-wide decision, so what happens when you group by college instead?

UTEP is situated in El Paso, a city that is literally right on the Mexican border, has 81% Hispanic population yet enrolls only 73% Hispanic students. So UTEP under-admits Hispanics, and so does almost every college in Texas, but since a few colleges just happen to have a ton of Hispanic applicants, on the chart it makes Texas look like it has a great record of admitting Hispanics! Framing UTEP as an argument for affirmative action helping Hispanics in Texas isn’t absentmindedly negligent, it’s criminally creative.

The Unintelligible Cloud of Dots Close to the Origin

aah-cloud

Of course, on the other side of UTEP we have a bunch of colleges in states with low Hispanic population, the ones that would be most interested in increasing Hispanic enrollment in the name of diversity. The baffling choice of axes makes all those colleges invisible and reduces their effect on the regression line to practically zero. It’s hard to see this horrible mess, but I can tell two things about states with low Hispanic population:

  1. States that have banned affirmative action have a much higher relative enrollment of Hispanics than those with AA. We can see it in the red “trend” line being higher on the left side of the chart, where all the points are.
  2. This point directly contradicts the story that the article is trying to sell you, so the data was squeezed into an indecipherable jumble.

To clarify, here’s a zoomed in version of the jumbled region:

aah-dong.png

Going back to Berkeley, this NY Times story on the “holistic” admission process in California can shed some light on why states that ban AA could have more Hispanics admitted. Without AA, the favoring of underrepresented races is just as strong, it’s just not explicit. Berkeley is 43% Asian in a state that’s only 15% such, and here’s what the NY Times writer observed:

After the next training session, when I asked about an Asian student who I thought was a 2 but had only received a 3 [lower is better], the officer noted: “Oh, you’ll get a lot of them.”


 The Real Picture

– psst

-What?

Did you know that Hispanics have a higher college enrollment rate than whites?

-Oh, you mean if we compare equally qualified applicants?

No, we’re not controlling for qualification.

-You’re probably adjusting for socioeconomic status or something.

Nope, it’s the entire national population of high school graduates aged 18-24, everyone who could possibly want to apply to college.

-No way.

Yes fucking way, from the Pew Research Center:

pew hispanic.png

No matter how you twist the data, three things seem to be pretty obvious:

  1. Dropping out of high school negatively affects your chances of enrolling in college.
  2. If you didn’t drop out, being Hispanic (as opposed to white) doesn’t negatively affect your chances of enrolling in college.
  3. Banning affirmative action doesn’t affect Hispanic enrollment in college, except maybe giving it a little boost in states with low Hispanic populations.

So how did Ms. Munguia deduce the conclusion that banning AA hurts Hispanics from the data? She never did, the conclusion came first and the chart made it too confusing for most people to tell either way. Deciding on a conclusion ahead of time and then sticking with that conclusion despite your own fricking data contradicting it is something you’d expect from, I don’t know, The British Journal of General Practice or something, not from 538. BTW, that’s a pretty good link if you like hate-reading about statistics abuses as much as I (clearly) do.


Train Your Nose

The goal of this post isn’t to attack affirmative action but to train you to spot dirty bullshit in a pretty chart. The wisdom of the elders says that it is easy to lie with statistics, but it is easier to lie without them. Bad charts and bad statistics leave telltale traces of chicanery: fishy measurement units, dubious aggregation levels, irrelevant regression lines and much more. A trained nose can spot these a mile away.

Detecting all the giveaways is hard, so here are two quick tips that will do you the most good:

  1. Don’t look at the title! Look at the actual chart first and see what result jumps out of it. If the title is about something else, it may be that the bottom line was written before the data.
  2. Ask yourself: what should the simplest chart look like that answers the question posed? If the chart in front of you looks nothing like the one you imagined, someone’s hiding something. That something is probably the truth.
aah-much_better.png
Again, this chart has no actual data, and yet is much more accurate than the original.

I emailed Ms. Munguia a few days ago with a more polite summary of the questions I had about her chart and the data. I’ll update this post immediately if she writes me back, even (especially!) if her response makes me look like an idiot.


Next post is finally up.

P.S.

There’s now a full, clear archive of all the posts so far, which I also hope to turn into a suggestion thread.