## Poked

I calculate high-index roots in my head while waiting for a poke bowl.

This was supposed to be the post analyzing the survey results.Then I thought: if I’m writing that, I may as well show examples of some basic Bayesian analysis, like using likelihood ratios. And if I’m doing analysis, I may as well give some more background on data science and also show how the results depend on assumptions. And if the results depend on assumptions, I may as well fit a full consequential model with continuous interdependent parameters and the appropriate prior.

Bottom line: I spent the week reading a textbook on data analysis and didn’t write anything. Instead, this short post is a sequel to Conned, part of an emerging series tentatively called “what it’s like being a crazy person who nitpicks random numbers he sees”.

So, a crazy person walks into a new poke restaurant. First, he notices that this restaurant, like the last 7 poke restaurants he went to, isn’t called Pokestop. This is puzzling, because the perfect name for a poke restaurant exists, and it’s Pokestop.

Then, the crazy person notices a Number:

200,000! That’s even more than the number of trees we could save by paying our electricity bills online!

The crazy person flips the menu, and gets so caught up in the math that he somehow orders a grotesque monstrosity made of surimi (I learned that it’s just a fancy word for imitation crab sticks), mango, seaweed, and Hawaiian salt (I learned that it’s just a fancy word for salt).

As the astonished cook reaches for the salted mango, the crazy person starts doing mental math.

200,000 combinations and we have 6 categories, so the average number of items in each category must be the 6th root of 200,000, or the cube root of the square root of 200,000. The square root of every even power of 10 is easy, i.e. √10,000 = 100. We’ll break 200,000 into 10,000*20. 20 is between 16 and 25 so the square root of 20 is ~4.5. This means that √200,000 ≈ 100*4.5 = 450. OK, I need the cube root of 450. Do I remember any cubes? 103=1,000, that’s too much. 83=29=512, bingo! The sixth root of 200,000 is the cube root of 450 is just below 8, so there should be 7-8 combinations on (geometric) average in each category. (The actual answer turns out to be 7.65).

That’s how I do math quickly in my head. I remember a few basic facts (like powers of 2 up to 1,024) and a few basic rules (like ( a m ) n =  a mn). I can get an approximate answer in my head to almost any calculation including roots, logarithms and exponents faster than I can pull out my phone. I taught a workshop training my MBA classmates to do this before consulting interviews. One Chinese girl was so impressed by this workshop that she dated me for a month even though she’s a straight 10 and I’m a 6.5 if I get a good haircut.

Anyway, back to poke: 200,000 options is obviously way too low. The average number is close to 7 or 8, but several of the categories allow you to pick more than one item. For example, you can create 16 combinations of toppings by choosing to include or exclude any of the four toppings available. As for add-ins, being able to pick 6 out of 13 involves the combination function, and the combination function has factorials in it so you know it means business. By the time my bowl was done, I estimated that Koshe Poke are underselling themselves by at least two orders of magnitude. It turns out they’re off by a factor of 3,500:

710 million! You can try a different combination of poke each day without repeating yourself for almost 2 million years. Sometime around 1,509,464 AD, you’ll stumble upon a combination as horrible as surimi-mango-salt-seaweed and you’ll finally understand what it’s like to be me, a crazy person living in a world of crazy numbers.

## Multiplicitous

Protect yourself from p-hacking with precision, whether you’re doing drugs or gambling.

## P-Vices

Of all the terrible vices known to man, I guess NFL gambling and alternative medicine aren’t very terrible. Making medicine that doesn’t work (if it worked, it wouldn’t be alternative) is also a tough way to make money. But if you’re able to squeeze a p-value of 0.05 for acupuncture out of a trial that clearly shows that acupuncture has zero effect, you can make money and get a PhD in medicine!

It’s also hard to make money off of gambling on the NFL. However, you can make money by selling NFL gambling advice. For example, before the Eagles played as 6 point underdogs on the turf in Seattle after a 208 yard rushing game, gambling guru Vince Akins declared:

The Eagles are 10-0 against the spread since Dec 18, 2005 as an underdog and on turf after they’ve had more than 150 yards rushing last game.

10-0! Betting against the spread is a 50-50 proposition, so 10-0 has a p-value of 1/(2^10) =  1/1024 = 0.0009. That’s enough statistical significance not just to bet your house on the Eagles, but also to get a PhD in social psychology.

The easiest way to generate the p-value of your heart’s desire it to test multiple hypotheses, and only report the one with the best p-value. This is a serious enough problem when it happens accidentally to honest and well-meaning researchers to invalidate whole fields of research. But unscrupulous swindlers do it on purpose and get away it, because their audience suffers from two cognitive biases:

1. Conjunction fallacy.
2. Sucking at statistics.

No more! In this new installment of defense against the dark arts we will learn to quickly analyze multiplicity, notice conjunctions, and bet against the Eagles.

## Hacking in Depth

[This part gets stats-heavy enough to earn this post the Math Class tag. If you want to skip the textbooky bits and get down to gambling tips, scroll down to “Reading the Fish”]

The easiest way to generate multiple hypotheses out of a single data set (that didn’t show what you wanted it to show) is to break the data into subgroups. You can break the population into many groups at once (Green Jelly Bean method), or in consecutive stages (Elderly Hispanic Woman method).

The latter method works like this: let’s say that you have a group of people (for example, Tajiks) who suffer from a medical condition (for example, descolada). Normally, exactly one half of sick people recover. You invented a miracle drug that takes that number all the way up to… 50%. That’s not good enough even for the British Journal of General Practice.

But then you notice that of the men who took the drug 49% recovered, and of the women, 51% did. And if you only look at women above age 60, by pure chance, that number is 55%. And maybe 13 of these older Tajik women, because Tajikistan didn’t build a wall, happened to be Hispanic. And of those, by accident of random distribution, 10 happened to recover. Hey, 10 out of 13 is a 77% success rate, and more importantly it gives a p-value of… 0.046! Eureka, your medical career is saved with the publication of “Miracle Drug Cures Descolada in Elderly Hispanic Women” and you get a book deal.

Hopefully my readers’ nose is sharp enough not to fall for 10/13 elderly Hispanic women. Your first guide should be the following simple rule:

Rule of small sample sizes – If the sample size is too small, any result is almost certainly just mining noise.

Corollary – If the sample size of a study is outnumbered 100:1 by the number of people who died because of that study,  it’s probably not a great study.

But what if the drug did nothing detectable for most of the population, but cured 61 of 90 Hispanic women of all ages. That’s more than two thirds, the sample size of 90 isn’t tiny, and it comes out to a p-value of 0.0005, is that good enough?

Let’s do the math.

First, some theory. P-values are generally a terrible tool. Testing with a p-value threshold of 0.05 should mean that you accept a false result by accident only 5% of the time, yet even in theory using a 5% p-value makes a fool of you over 30% of the time.  P-values do one cool thing, however: they transform any distribution into a uniform distribution. For example, most samples from a normal distribution will lie close to the mean, but their p-values will be spread evenly (uniformly) across the range between 0 and 1.

Uniform distributions are easy to deal with. For example, if we take N samples from a uniform distribution and arrange them by order, they will fall on average on 1/N+1, 2/N+1, .. N/N+1. If you test four hypotheses (e.g. that four kinds of jelly beans cause acne) and their p-values fall roughly on 1/(4+1) = 0.2, 0.4, 0.6, 0.8 you know that they’re all indistinguishable from the null hypothesis as a group.

Usually you would only see the p-value of the best hypothesis reported, i.e. “Wonder drug cures descolada in elderly Hispanic women with p=0.0005”. The first step towards sanity is to apply the Bonferroni rule:

Bonferroni Rule – A p-value of α for a single hypothesis is worth about as much as a p-value of α/N for the best of N hypotheses.

The Bonferroni correction is usually given as an upper bound, namely that if you use an α/N p-value threshold for N hypotheses you will accept a null hypothesis as true no more often than if you use an α threshold for a single hypothesis. It actually works well as an approximation too, allowing us to replace no more often with about as often. I haven’t seen this math spelled out in the first 5 google hits, so I’ll have to do it myself.

h1,…,hN are N p-values for N independent tests of null hypotheses. h1,…,hN are all uniformly distributed between 0 and 1.

The chance that one of the N p-values falls below α/N = P(min(h1,…,hN) < α/N) = 1 – (1 – α/N)N ≈ 1 – e ≈ 1 – (1-α) = α = the chance that single p-value falls below α. The last bit of math there depends on a linear approximation of ex =1+x when x is close to 0.

The Bonferroni Rule applies directly when the tests are independent, but that is not the case with the Elderly Hispanic Woman method. The “cure” rate of white men is correlated positively with the rate for all white people (a broader category) and with young white men (a narrower subgroup). Is the rule still good for EHW hacking? I programmed my own simulated unscrupulous researcher to find out, here’s the code on GitHub.

My simulation included Tajiks of three age groups (young, adult, old), two genders (the Tajiks are lagging behind on respecting genderqueers) and four races (but they’re great with racial diversity). Each of the 2*3*4=24 subgroups has 500 Tajiks in it, for a total population of 12,000. Descolada has a 50% mortality rate, so 6,000 / 12,000 cured is the average “null” result we would expect if the drug didn’t work at all. For the entire population, we would get p=0.05 if only 90 extra people were cured (6,090/12,000) for a success rate of 50.75%. A meager success rate of 51.5% takes the p-value all the way down to 0.0005.

Rule of large sample sizes – With a large sample size, statistical significance doesn’t equal actual significance. With a large enough sample you can get tiny p-values with miniscule effect sizes.

Corollary – If p-values are useless for small sample sizes, and they’re useless for large sample sizes, maybe WE SHOULD STOP USING FUCKING P-VALUES FOR HYPOTHESIS TESTING. Just compare the measured effect size to the predicted effect size, and use Bayes’ rule to update the likelihood of your prediction being correct.

P-values aren’t very useful in getting close to the truth, but they’re everywhere, they’re easy to work with and they’re moderately useful for getting away from bullshit. Since the latter is our goal in this essay we’ll stick with looking at p-values for now.

Back to Tajikistan. I simulated the entire population 1,000 times for each of three drugs: a useless one with a 50% success rate (null drug), a statistically significant one with 50.75% (good drug) and a doubly significant drug with 51.5% (awesome drug). Yes, our standards for awesomeness in medicine aren’t very high. I looked at the p-value of each possible sub-group of the population to pick the best one, that’s the p-hacking part.

Below is a sample of the output:

13 hispanic 1    0.122530416511473
14 female hispanic 2    0.180797304026783
15 young hispanic 2    0.25172233581543
16 young female hispanic 3    0.171875
17 white 1    0.0462304905364621
18 female white 2    0.572232224047184
19 young white 2    0.25172233581543
20 young female white 3   0.9453125
23 asian 1    0.953769509463538
24 female asian 2    0.819202695973217

The second integer is the number of categories applied to the sub-group (so “asian” = 1, “asian adult female” = 3). It’s the “depth” of the hacking. In our case there are 60 groups to choose from: 1 with depth 0 (the entire population), 9 with depth 1, 26 with depth 2, 24 with depth 3. Since we’re testing the 60 groups as 60 separate hypotheses, by the Bonferroni Rule the 0.05 p-value should be replaced with 0.05 / 60 categories = 0.00083.

In each of the 1,000 simulations, I picked the group with the smallest p-value and plotted it along with the “hacking depth” that achieved it. The vertical lines are at p=0.05 p=0.00083. The horizontal axis the hacked p-value on a log scale and the vertical how many of the 1,000 simulations landed below it:

For the null drug, p-hacking achieves a “publishable” p-value of

If your goal is to do actual science (as opposed to getting published in Science), you want to be comparing the evidence for competing hypotheses, not just looking if the null hypothesis is rejected. The null hypothesis is that we have a 50% null drug, and the competing hypothesis are the good and awesome drugs at 50.75% and 51.5% success rates, respectively.

Without p-hacking, the null drug will hit below p=0.05 5% of the time (duh), the good drug will get there 50% of the time, and the awesome drug 95% of the time. To a Bayesian, getting a p-value below 0.05 is a very strong signal that we have a useful drug on our hands: 50%:5% = 10:1 likelihood ratio that it’s the good drug and 95%:5% = 19:1 that it’s the awesome drug. If ahead of time we thought that each of the cases is equally likely (1:1:1) ratio, our ratios would now be 1:10:19, this means that the probability that the drug is the null one went from 1/3 to 1/(1+10+19) = 1/30. The null drug is 10 times less likely.

If you’re utterly confused by the preceding paragraph, you can read up on Bayes’ rule and likelihood ratio on Arbital, or just trust me that without p-hacking, getting a p-value below 0.05 is a strong signal that the drug is useful.

With p-hacking, however, the good and the awesome drugs don’t do so well. We’re now looking how often each drug falls below the Bonferroni Rule line of p=0.00083. Instead of 50% and 95% of the time, the good and awesome drugs get there 23% and 72% of the time. If we started from 1:1:1 odds, the new odds are roughly 1:5:15, and the probability that the drug is the null one is 1/21 instead of 1/30. The null drug is only 7 times less likely.

Rule of hacking skepticism – Even though a frequentist is happy with a corrected p-value, a Bayesian knows better. P-hacking helps a bad (null) drug more than it does a good one (which is significant without hacking). Thus, hitting even a corrected p-value threshold is weaker evidence against the null hypothesis.

You can see it in the chart above: for every given p-value (vertical line) the better drugs have more green points in its vicinity (indicating less depth of hacking) and the bad drug has more red because it has to dig down to a narrow subgroup to luck into significance.

Just for fun, I ran another simulation in which instead of holding the success probability for each patient constant, I fixed the total proportion of cures for each drug. So in the rightmost line (null drug) exactly 6,000 of the 12,000 were cured, for the good drug exactly 6,090 and for the awesome drug 6,180.

We can see more separation in this case – since the awesome drug is at p=0.0005 for the entire group, hacking can not make it any worse (that’s where the tall green line is). Because for each drug the total cures are fixed, if one subgroup has more successes the other one by necessity have less. This mitigates the effects of p-hacking somewhat, but the null drug still gets to very low p-values some of the time.

So what does this all mean? Let’s use the rules we came up with to create a quick manual for interpreting fishy statistics.

1. Check the power – If the result is based has a tiny sample size (especially with a noisy measure) disregard it and send an angry email to the author.
2. Count the categories – If the result presented is for a subgroup of the total population tested (i.e. only green beans, only señoritas) you should count N – the total number of subgroups that could have been reported. Jelly beans come in 50 flavors, gender/age/race combine in 60 subgroups, etc.
3. Apply the correction – divide the original threshold p-value by the N you calculate above. If the result is in that range, it’s statistically significant.
4. Stay skeptical – Remember that a p-hacked result isn’t as good of a signal even with correction, and that statistical significance doesn’t imply actual significance. Even an infinitesimal p-value doesn’t imply with certainty that the result is meaningful, per the Rule of Psi.

Rule of Psi – A study of parapsychological ability to predict the future produced a p-value of 0.00000000012. That number is only meaningful if you have absolute confidence that the study was perfect, otherwise you need to consider your confidence outside the result itself. If you think that for example there’s an ε chance that the result is completely fake, that ε is roughly the floor on p-values you should consider.

For example, if I think that at least 1 in 1,000 psychology studies have a fatal experimental flaw or are just completely fabricated, I would give any p-value below 1/1,000 about as much weight as 1/1,000. So there’s a 1.2*10^-10 chance that the parapsychology meta-analysis mentioned above was perfect and got a false positive result by chance, but at least 1 in 1,000 chance that one of the studies in it was bogus enough to make the result invalid.

Let’s apply our manual to the Eagles game:

The Eagles are 10-0 against the spread since Dec 18, 2005 as an underdog and on turf after they’ve had more than 150 yards rushing last game.

First of all, if someone tells you about a 10-0 streak you can assume that the actual streak is 10-1. If the Eagles had won the 11th game going back, the author would have surely said that the streak was 11-0!

The sample size of 11 is really small, but on the other hand in this specific case the error of the measure is 0 – we know perfectly well if the Eagles won or lost against the spread. This doesn’t happen in real life research, but when the error is 0 the experimental power  is perfect and a small sample size doesn’t bother us.

What bother us is the insane number of possible variables the author could have mentioned. Instead of [Eagles / underdog / turf / after 150 yards rushing], the game could be described as [Seattle / at home / after road win / against team below average in passing] or [NFC East team / traveling west / against a team with a winning record] or [Team coming off home win / afternoon time slot / clear weather / opponent ranked top-5 in defense]. It’s hard to even count the possibilities, we can try putting them in separate bins and multiplying:

1. Descriptors of home team – division, geography, record, result of previous game, passing/rushing/turnovers in previous game, any of the many season stats and rankings – at least 20
2. Descriptors of road team – at least 20
3. Game circumstances – weather, time slot, week of season, field condition, spread, travel, matchup history etc. – at least 10.

Even if you pick just 1 descriptor in each category, this allows you to “test” more than 20*20*10 = 4,000 hypotheses. What does it mean for the Eagles? A 10-1 streak has a p-value of 0.0107, about 1/93. But we had 4,000 potential hypotheses! 1/93 is terrible compared to the p=1/4,000 we would have expected to see by chance alone.

Of course, this means that the gambling guru didn’t even bother testing all the options, he did just enough fishing to get a publishable number and published it. But with so much potential hacking, 10-1 is as much evidence against the Eagles as it is in their favor. The Eagles, a 6 point underdog, got their asses kicked by 11 points in Seattle.

You can apply the category-counting method whenever the data you’re seeing seems a bit too selected. The ad above tells you that Trinity is highly ranked in “faculties combining research and instruction”. This narrow phrasing should immediately make you think of the dozens of other specialized categories in which Trinity College isn’t ranked anywhere near the top. A college ranked #1 overall  is a great college. A college ranked #1 in an overly specific category within an overly specific region is great at fishing.

## To No And

It’s bad enough when people don’t notice that they’re being bamboozled by hacking, but digging deep to dredge up a narrow (and meaningless) result can sound more persuasive than a general result. Here’s an absolutely delightful example, from Bill Simmons’ NFL gambling podcast:

Joe House: I’m taking Denver. There’s one angle that I like. I like an angle.

Bill: I’m waiting.

Joe House: Defending Super Bowl champions, like the Denver Broncos, 24-2 against the spread since 1981 if they are on the road after a loss and matched up against a team that the previous week won both straight up and against the spread and the Super Bowl champion is not getting 7 or more points. This all applies to the Denver Broncos here, a wonderful nugget from my friend Big Al McMordie.

Bill: *sigh* Oh my God.

I’ve lost count of how many categories it took House to get to a 24-2 (clearly 24-3 in reality) statistic. What’s impressive is how House sounds more excited with each “and” he adds to the description of the matchup. To me, each new category decreases the likelihood that the result is meaningful by multiplying the number of prior possibilities. To House, it seems like Denver fitting in such an overly specific description is a coincidence that further reinforces the result.

This is called the conjunction fallacy. I’ll let Eliezer explain:

The conjunction fallacy is when humans rate the probability P(A&B) higher than the probability P(B), even though it is a theorem that P(A&B) <= P(B).  For example, in one experiment in 1981, 68% of the subjects ranked it more likely that "Reagan will provide federal support for unwed mothers and cut federal support to local governments" than that "Reagan will provide federal support for unwed mothers." […]

Which is to say:  Adding detail can make a scenario SOUND MORE PLAUSIBLE, even though the event necessarily BECOMES LESS PROBABLE. […]

In the 1982 experiment where professional forecasters assigned systematically higher probabilities to “Russia invades Poland, followed by suspension of diplomatic relations between USA and USSR” versus “Suspension of diplomatic relations between USA and USSR”, each experimental group was only presented with one proposition. […]

What could the forecasters have done to avoid the conjunction fallacy, without seeing the direct comparison, or even knowing that anyone was going to test them on the conjunction fallacy?  It seems to me, that they would need to notice the word “and”.  They would need to be wary of it – not just wary, but leap back from it.  Even without knowing that researchers were afterward going to test them on the conjunction fallacy particularly.  They would need to notice the conjunction of two entire details, and be shocked by the audacity of anyone asking them to endorse such an insanely complicated prediction.

Is someone selling you a drug that works only when the patient is old and a woman and Hispanic? A football team that is an underdog and on turf and good at rushing? One “and” is a warning sign, two “ands” is a billboards spelling BULLSHIT in flashing red lights. How about 11 “ands”?

11 “ands” is a level of bullshit that can only be found in one stinky stall of the washrooms of science, the gift that keeps on giving, the old faithful: power posing. After power posing decisively failed to replicate in an experiment with 5 times the original sample size, the authors of the original study listed 11 ways in which the experimental setup of the replication differed from the original (Table 2 here). These differences include: time in poses (6 minutes in replication vs. 2 minutes in the original), country (Switzerland vs. US), filler task (word vs. faces) and 8 more. The authors claim that any one of the differences could account for the failure of replication.

What’s wrong with this argument? Let’s consider what it would mean for the argument to be true. If it’s true that any of the 11 changes to the original setup could destroy the power posing effect, it means that the power posing effect only exists in that very specific setup. I.e. power posing only works when the pose is held for 2 minutes and only for Americans and only after a verbal task and 8 more ands. If power posing requires so many conjunctions, it was less probable to start with than the chance of Amy Cuddy admitting that power posing isn’t real.

The first rule of improv comedy is “Yes, and…” The first rule of statistical skepticism is “And…,no.”

Rule of And…, no – When someone says “and”, you say “no”.

## The Power of Power Skepticism

A quick estimate of experimental power can discern good science from bullshit. Too bad that some scientists forget to do it themselves.

I’m back from vacation! Here are some of the things we did abroad:

1. Ate too much octopus. Or is it “too many octopods”?
2. Bribed a cop.
3. Watched quail chicks hatch.
4. Got locked in an apartment and escaped by climbing from roof to roof.
5. Poked a sea turtle.
6. Endured getting stung by mosquitoes, jellyfish and fire coral.
7. Swam in the sea. Swam in a cave. Swam in a river. Swam in a cave-river flowing into the sea.
8. Snuck into a defunct monkey sanctuary whose owner was killed by a runaway camel and which is now haunted by mules and iguanas.

I returned from the wilderness tanner, leaner, and ready to expound about proper epistemology in science. This is going to get really technical, I need to get my math-cred back after writing 5,000 words on feminism, Nice Guys and living with my ex-girlfriend.

## Effect-skeptic, not Ovulation-skeptic

This post is about gauging the power of experiments in science. I originally presented this  to a group of psychology grad students, as advice on how to avoid publishing overconfident research that will fail to replicate (I’ve never actually published a scientific article myself, but neither was I ever accused of excess humility). This subject is just as relevant to people who read about science: quickly estimating the power of an experiment you read about can give you a strong hint whether it’s a groundbreaking discovery or a p-hacked turd sandwich. Like other tools in the Putanumonit arsenal of bullshit-detectors, this is tagged defense against the dark arts.

I was inspired by Prof. Uri Simonsohn, data vigilante and tireless crusader against the dark arts of fake data, misleading analysis and crappy science. Simonsohn recently appeared on Julia Galef’s Rationally Speaking podcast and had a very interesting take on science skepticism:

Some people would say you really have to bring in your priors about a phenomenon before accepting it. I think the risk with that is that you end up being too skeptical of the most interesting work, and so you end up, in a way, creating an incentive to doing obvious and boring research.

I have a bit of a twist on that. I think we should bring in the priors and our general understanding and skepticism towards developing the methodology, almost blind to the question or the hypothesis that’s being used.

Let’s say you tell me you ran an experiment about how preferences for political candidates shift. Then I should bring to the table how easy it is to shift political preference in general, how noisy those measures are and so on, and not put too much weight on how crazy I think it is that you tell me you’re changing everything by showing an apple below awareness. My intuition on how big the impact of apples below awareness are in people is not a very scientific prior. It’s a gut feeling.

I don’t know that the distinction is clear, but when it’s my prior about the specific intervention you’re claiming, there I try not to trust my intuition. And the other one is, what do I know about the reliability of the measures, how easy it is to move the independent variable? There I do, because in the latter case it’s based on data andthe other one is just my gut feeling.

Here’s how I understand it: when you read a study that says “A causes B as measured by the Tool of Measuring B (ToMB)” you usually know more about B (which is some variable of general interest) and the accuracy of ToMB than you know about A (which is usually something the study’s authors are experts on). Your skepticism should be based on these two questions:

• How easy it is to move variable B and by how much?
• Is the sample size large enough to measure the effect with ToMB?

You should not be asking yourself:

• How likely was A to cause the effect?

Because you don’t know. If everyone had a good idea of what A does, scientists wouldn’t be researching it.

As Simonsohn alludes to, political-choice research is notorious for preposterous effect size claims. Exhibit A is the unfortunate “The fluctuating female vote: politics, religion, and the ovulatory cycle” which claims that 17% of married women shifted from Obama to Romney during ovulation. This should smell like bullshit not because of the proposed cause (ovulation) but because of the proposed effect (voting switch): shifting even 1% of voters to the opposite party is insanely hard. Reason doesn’t do it. Self-interest doesn’t do it. I suspect that we would notice if 10 million women changed their political affiliation twice a month.

Finding an effect that large doesn’t mean that the experiment overstated a small effect, it means that the experiment is complete garbage. +17% is basically as far away from +1% as it is from -1% (effect in the opposite direction). An effect that’s 20 times larger than any comparable intervention doesn’t tell you anything except to ignore the study.

Per Simonsohn, the ovulation part should not be a reason to be doubtful of the particular study. After all, the two lead authors on “fluctuating” (both women) certainly know more about ovulation that I do. In fact, the only thing I know of that correlates highly with ovulation is the publishing of questionable research papers.

## Stress Testing

For scientists, the time to assess the possible effect sizes is before the experiment is conducted, not afterwards. If the ovulation researchers predicted an effect of +0.5% before doing the study, the +17% result would’ve alerted them that something went wrong and spared them from embarrassment when the study was published and when it was (predictably) contradicted by replications.

Estimating the effect size can also alert the researchers that their experiment isn’t strong enough to detect the effect even if it’s real, that they need to increase the sample size or design more accurate measurements. This could’ve prevented another famous fiasco in psychology research: the power pose.

“Power posing” is the subject of the second most popular TED talk of all time. I got the link to it from my ex-girlfriend who eagerly told me that watching it would change my life. I watched it (so you shouldn’t). by Carney, Cuddy and Yap (CC&Y). I offered to bet my ex-girlfriend $100 that it wouldn’t replicate. Spoiler alert: it didn’t, and I am currently with superior scientific skepticism skills. Let’s see how predictable the powerlessness of power posing was ahead of time. CC&Y claim that holding an expansive “power” pose not only increases self-reported “feeling of power” but also affected lowered levels of cortisol – a hormone that is released in the body in response to stress and affects functioning. We want to find out how cortisol fluctuates throughout the day (which would affect measurement error) and how it responds to interventions (to estimate effect size). A quick scholar-Googling leads us to a 1970 paper on the circadian cortisol pattern in normal people. It looks like this: Cortisol levels vary daily over the range between 5 and 25 µg/dl (shown on the chart as 0-20). Daily mean cortisol levels vary by about 4 µg/dl person-to-person (that’s the standard deviation) and measurements 20 minutes apart for the same person vary by about 2.5 µg/dl. There’s also an instrument error (how accurately taking a saliva sample measures actual cortisol levels) which is too annoying to google. Since CC&Y measure the difference in cortisol levels between two groups of 21 people 17 minutes apart, their standard error of measurement should be around: $\sqrt{\frac{4^2 + 2.5^2 + something^2}{21}} \approx 1.2$ Ideally, to measure the effect of anything on cortisol that effect should be at least 3 times the measurement error, or around 3.6. A source for possible effect size estimates is this work on caffeine, stress and cortisol by Lovallo 2008. The caffeine paper uses a slightly different measurement of cortisol but we can superimpose the 1-5 range in the chart on the normal daily 5-25 µg/dl level. Best I could tell, drinking a ton of coffee affect cortisol levels by 4 µg/dl. More interestingly, the subjects in Lovallo’s study underwent a “stress challenge” at 10 am specifically designed to raise their cortisol. Which it did, by around 2 µg/dl. I may be willing to accept that 1 minute of posing has half the effect of a 30 minute stress challenge on stress hormones, but no more. That means that by having only 42 participantsCC&Y are looking for a 1 µg/dl effect in a measurement with 1.2 µg/dl error. These numbers mean that the experiment has 13% power (I’m perhaps too generous) to detect the effect even at the weak p=0.05 level. The experiment has a 20% chance to find an effect with the opposite sign (that power poses raise cortisol instead of reducing it) even if CC&Y’s hypothesis is true. Unless something is inherently broken with the measurement methodology, the straightforward solution to increase experimental power is to increase the sample size. How much does it cost to pay an undergrad to spend 20 minutes in the lab and spit in a tube? A cynic would say that the small sample size was designed to stumble upon weird results. I don’t know if I’m that cynical, but the replication recruited 200 subjects and found that the effect on cortisol is as follows: When an underpowered experiment finds a significant effect, it’s rarely because the scientists got lucky. Usually, it’s because the experiment and analysis were twisted enough to measure some bias or noise as a significant effect. It’s worse than useless. ## Too Late for Humility There’s nothing wrong with running an underpowered experiment as exploratory research – a way of discovering fruitful avenues of research rather than establishing concrete truths. Unfortunately, my friends who actually publish papers in psychology tell me that every grant request has “power = 80%” written in it somewhere, otherwise the research doesn’t get funded at all. A scientist could expend the effort of calculating the real experimental power (13% is a wild guess, but it’s almost certainly below 20% in this case), even if that number is to be kept in a locked drawer. If she does, she’ll be skeptical enough not to trust results that are too good to be true. Here’s a beautiful story of two scientists who got a result that seemed a tad too good for the quality of their experimental set-up. They stayed skeptical, ran a replication with pre-planned analysis (the effect promptly disappeared) and spun the entire ordeal into a publication on incentive structures for good science practices! Here’s the under appreciated key part in their story: We conducted a direct replication while we prepared the manuscript. Humility and skepticism have a chance to save your soul (and academic reputation), but only until you are published. Perhaps Carney and Cuddy would’ve agreed with my analysis of the 13% power, but there’s no sign that they did the calculation themselves. Even if they originally wrote “80%” just as a formality to get funded, once they neglected to put a real number on it nothing will keep them from believing that 80% is true. Confirmation bias is an insidious parasite, and Amy Cuddy got the TED talk and a book deal out of it even as all serious psychologists rushed to dismiss her findings. As the wise man said: “ In her TED talk, Cuddy promises that power posing has the power to change anyone’s life. In a journal reply to the replication by Ranehill, she’s reduced to pleading that power posing may work for Americans but not Swiss students, or that it worked in 2008 but not in 2015, or for people who have never heard of power posing previously, making the TED talk self-destructive. If you’re left wondering how Carney, Cuddy and Yap got the spurious results in the first place, they oblige to confess it themselves: Ranehill et al.used experimenters blind to the hypothesis, and we did not. This is a critical variable to explore given the impact of experimenter bias and the pervasiveness of expectancy effects. Let me translate what they’re saying: We know that non-blind experiments introduce bogus effects, that our small-sample non-blind experiment found an effect but a large-sample blind study didn’t, but yet we refuse to consider the possibility that our study was wrong because we’re confirmation biased like hell and too busy stackin’ dem benjamins. ## Questioning Love and Science Let’s wrap up with a more optimistic example of using power-estimation to inform us as readers of research, not scientists. At the end of Love and Nice Guys I mentioned the article on the 36 questions that create intimacy, itself based on this research by Arthur Aron et al. Again, we withhold judgment on the strength of the intervention (the 36 questions) and focus on the effect, in this case as measured by the corny-sounding Inclusion of Other in the Self Scale (IOS). A search of interventions measured by IOS leads us to the aptly-titled “Measuring the Closeness of Relationships” by Gachter et al. which includes the following table: The IOS measures an equal difference (1.4-1.5) between good friends and either the “closest, deepest, most involved, and most intimate relationship” or on the other hand an “acquaintance, but no more than an acquaintance”. The SD of the scale is 1.3, and since Aron has 50 people in each group (36 questions vs. just small talk) we divide 1.3 by the square root of 50 to get a standard error of 0.18. To achieve 80% power at p=0.05 (80% at .05 should be the bare minimum standard) the effect of an hour discussing the 36 questions should be 0.5, or roughly one third of the distance between acquaintances and friends. (Here’s a power-calculator, I just use Excel). An intimate hour taking people one third of the way to friendship doesn’t seem implausible, and in fact the study finds that the intimacy-generating 36 questions increase IOS by 0.88: high enough to be detectable, low enough to be plausible. We don’t know if the IOS is really a great measure of intimacy and potential for love, but that’s outside the scope of the study. They found the effect that was there, and I expect these findings to replicate when people try to confirm them. Putanumonit endorses asking potential partners “how do you feel about your relationship with your mother?” (question 24) as trustworthy science. The whole of the googling and power-calculating took me just under half an hour. I’m not this energetically skeptical about every piece of science news I hear about, but I am more likely to be suspicious of research that pops up on my Facebook wall. If you read about a new study in the NY Times or IFLScience, remember that it’s covered because it’s exciting. It’s exciting because it’s new and controversial, and if it’s new and controversial it’s much more likely than other research to be flat out wrong. If you think of power-posing every day or planning to seduce the man of your dreams with questions, 30 minutes of constructive skepticism can keep you out of the worst trouble. ## Dating: a Research Journal, Part 3 Is dating a game? It is if I can apply some game theory to figure it out. To recap the series so far, part 1 talked about economics (comparative advantage), algorithms (pursuing vs. choosing) and marketing (personalizing your message and standing out from the crowd). The second part applied simple algebra to “hack” OkCupid’s match percentage. Is there any quantitative theory that I haven’t yet mangled in the pursuit of dating advice? ## Part 3 – Don’t hate the game Love is a game that two can play and both win by losing their heart.” – Eva Gabor Game theory is a laughable attempt to simplify complex and uncertain human interactions to simple models in which rational actors choose from a limited set of strategies in pursuit of simple, well known payouts. The sparse triumphs of game theory have come from informing straightforward problems like nuclear disarmament and counter-terrorism. Only a maniac could think to apply game theory to the infinitely more complex problem of texting after a date. We’ll use the simplest of game theory. Our games will have two players: Alice and Bob. This is by mathy convention and because threesomes are hard, not because I have anything against gay orgies. Alice and Bob have a lot of possible actions they can take at any stage of a relationship, those fall into two broad classes: actions that are beneficial to the other person (messaging, setting up a date, commitment, being a loving partner for decades) and actions that don’t and are primarily selfish. Let’s broadly call these Woo and Neglect. The actions Alice and Bob take result in outcomes for both of them, anything from the small joy of a “you’re cute” text to the heartbreaking pain of a “ur Kut” text. The outcomes are not necessarily selfish: happiness for your beloved is included. Of course, we’ll reduce all that complexity to simple numbers – each player is trying to get the best outcome, in our game – the highest number. Let’s start with a simple game to become familiar with the notation: wooing gives 3 points to the other player. For example, wooing is simply liking the other person:  Liking Game Bob woos Bob neglects Alice woos Alice gets 3 , Bob gets 3 A: 0 , B: 3 Alice neglects A: 3 , B: 0 A: 0 , B: 0 In this game, neither player has a strong incentive to do anything: wooing is costless but it doesn’t help you personally. Alice and Bob both like to be liked, but can’t do much about it. Let’s add a wrinkle: wooing gives 3 points to the other but costs 1 point to yourself. For example, wooing is telling the other person that you like them and asking them on a date. The costs are things like losing the option to ask out someone else and the effort of planning the date.  Dater’s Dilemma Bob woos Bob neglects Alice woos 2 , 2 -1 , 3 Alice neglects 3 , -1 0, 0 We’ve ended up in a tricky situation: every player prefers to neglect regardless of the other’s choice. Whether Alice woos or neglects, Bob will always get more by neglecting. Both players prefer mutual wooing, but will end up in the bottom right mutual neglect square! This thorny predicament is the classic Prisoner’s Dilemma, in which both players’ self-interest prevents their cooperation. The prisoner’s dilemma has been extensively researched for six decades usually with the goal of finding a solution that makes both players cooperate (woo) for mutual benefit. These solutions fall into three broad classes: 1. Utilizing super-rational timeless decision theory. This requires either being a super-intelligent reasoner with access to the other player’s decision-making source code, or reading a 120-page PDF. 2. Enforcing a cooperation contract. If the Alice and Bob can remove each other’s “neglect” option, they end up in woo/woo for lack of alternative. Common ways to try and achieve this are entering into holy matrimony or, for those that are really serious, changing your status on Facebook. This is slightly less arduous than reading a long PDF, but may still not work for everybody. 3. Playing the “game” several times and rewarding the other player’s “woos” by using a tit-for-tat strategy. The simplest example of tit-for-tat is a promise to keep wooing as long as the other person does. Instead of analyzing a series of games we can fold that incentive into the outcomes of a single game. If mutual-wooing is rewarded, it gives an extra 3 points to yourself because you’ll partner will woo in the next round. Let’s see how the game looks like with 3 points for each player added to the woo/woo square:  Stag Hunt Bob woos Bob neglects Alice woos 5 , 5 -1 , 3 Alice neglects 3 , -1 0, 0 This game is called the Stag Hunt (the names comes from a scenario in which two hunters must cooperate to hunt a stag, not from a scenario where the hunters meet for a). The new scenario changed an adversarial game into a game of cooperation: each player does best by matching what the other player does. If Alice woos, Bob gets an extra 2 points (5 instead of 3) from reciprocating the woo. If Alice neglects, Bob gains an extra 1 point by neglecting as well. Finally, our model is telling us some interesting things. The first thing to note is that the stag hunt has a chance to end up in woo/woo only if both players get more from their wooing being reciprocated than from neglecting. For example, maybe Bob doesn’t like Alice that much, her reciprocated affection is only worth 1 points to him instead of 3.  One-Sided Stag Bob woos Bob neglects Alice woos 5 , 3 -1 , 3 Alice neglects 3 , -1 0, 0 Even though Alice still wants mutual wooing, Bob can always do at least as well for himself by neglecting. As soon as he neglects, it makes sense for Alice to do the same and the pair will end up in the bottom right neglect/neglect square. No matter how strong Alice’s feelings are, it takes two to tango. The same will happen if the rewards for neglecting are high, for example if Alice has many suitors and wants to keep her dating options open. Let’s go back to the original Stag Hunt. Alice tries to match Bob’s move, but what if she doesn’t know which move he’s making? Imagine the scenario: last night was Alice and Bob’s first date, pleasant but not breathtaking. Did Bob like her too? What if he did but he has other dates set up? Will he text her to ask her out again? What if someone told him that real men wait 3 days to text back? What if it was? To account for uncertainty, Alice can look at the game probabilistically. If Alice thinks that there’s a probability P that Bob will woo her, her expected outcomes for each action are as follows: Outcome for wooing = P x 5 + (1-P) x (-1) = 6P – 1 Outcome for neglecting = P x 3 + (1-P) x 0 = 3P Alice will prefer wooing if 6P-1 > 3P, or in other words if P > 1/3. This doesn’t sound so bad: as long as Bob thinks there’s a 1 in 3 chance that Alice is waiting for him and Alice thinks there’s a 1 in 3 chance he’ll eventually ask her out both players will end up in the best situation. The problem is that the threshold probability for a happily-ever-after is sensitive to even small changes in the payouts (ignore for the moment that the payouts are made up anyway). Let’s say that Alice and Bob both read that makes it extremely easy for them to set up another date with a good looking stranger. Let’s assume that this gives them an extra 1 point for neglecting, since neglecting means going back to the endless well of OkCupid for a new match.  Stag Hunt in the age of OkCupid Bob woos (probability P) Bob neglects (probability 1-P) Alice woos 5 , 5 -1 , 4 Alice neglects 4 , -1 1 , 1 Alice’s outcome for wooing is still 6P-1, but now her outcome for neglecting is 3P+1. To justify wooing, she needs 3P > 2, or P > 2/3 – a much higher bar! That’s the lament in the age of OkCupid: the easier it is to get a first date, the harder it is to get to a fifth. Personal note: a lot of people confuse this lesson with a different one, namely that the easier it is to find casual sex, the harder it is to find a lasting relationship because guys will not commit to a woman when they can sleep around for free. This strikes me as utterly false, I have never lost respect or affection for anyone because they had sex with me. Quite the opposite! The familiar trope of dating being a contest of wills in which the man wants sex and the woman wants a ring turns romance back into a prisoner’s dilemma, and prisoner’s dilemmas rarely result in lasting happiness. In online dating it may be hard to tell apart the guys that are looking for long-term relationships (high P) from those that don’t (low P), but I don’t think the guys themselves shift their preferences in response to the “market” that much. What can Bob and Alice do to end up in the top left corner? One solution is to increase the reward for wooing, if Alice got 20 points from Bob’s affection, she would wait for his text even if she thinks it’s not likely to arrive. Whenever there’s a good game theory equilibrium to be had you know that the European pied flycatcher will take advantage of it: a sexy flycatcher male will mate with several females in different nests, the less attractive male will attract a single female by giving her his undivided attention. Increasing your attractiveness is hard, not everyone has smooth back feathers, a skill in nest construction and a pitch-perfect mating song. There’s an easier way to win at coordination games: pre-commitment. In most cases in life you want to keep your options open, but in coordination games (and even some adversarial games) the best move is to get rid of uncertainty by getting rid of some of your options. When I started dating in NYC I heard every possible advice regarding the post-first date text from “If you don’t text in 30 minutes to check that she made it home OK she’ll know you don’t give a shit about her” to “Anything less than a week makes you seem desperate”. The problem is, there’s no right answer that works for everyone. Some women will give up if I don’t text the same day and find someone else, and some will see it as breaking a norm if I do. Here’s what I ended up saying at the end of every first date: I had a great time tonight, I’m going to text you tomorrow at 8 pm and if you’re into me we can set something up for next week. Here’s the game theory translation: I commit myself to playing “woo” for a day, so if you want to coordinate you can play “woo” without being afraid of uncertainty. Since you know exactly when to expect my text, if I don’t hear back from you tomorrow I’ll see that as a clear signal that you’re playing “neglect” and accordingly will switch to “neglect” from that point on. Both of us have a day to decide if we like each other, but I have eliminated any chance of our relationship failing because of unpredictability and bad coordination. It felt a little awkward the first time I did it, but the real awkwardness is stressing for days over what should be simple and fun – telling someone you like them. Ladies, there’s nothing at all about this strategy that wouldn’t work equally well for you. This week a girl who saw my profile on OkCupid read this blog, so we had no choice but to have lunch to discuss romantic game theory for three hours. She said that she always texts within 24 hours if she doesn’t hear from the guy first. If the guy doesn’t like her, she just saved herself time. If the guy was just shy, she helped him out. And if the guy doesn’t like girls texting him first, he’s not the man for her anyway. There’s a general theme that cooperating with a partner is much easier than overpowering an opponent. In the end, everyone in your dating pool has the same goals, every interaction from message to marriage should be seen as an opportunity to coordinate. Yet, a lot of the dating advice you read treats it as antagonistic competition, advising you to look for an edge and keep your cards close to your chest. Next week I’ll explore a different foundation to build romance on: total honesty, total openness, total vulnerability. And BDSM. ## Martin Lotto King Some housekeeping notes and figuring out what the jackpot has to be for a positive-value lottery. The criminal the luminary Rev. Dr. King has a lot to teach us, including to avoid the worst argument in the world. I was heartbroken to see this rampant racial segregation in America, on MLK Day no less! I was informed by WordPress that some of the ad revenue on Putanumonit will begin to accrue to yours truly. I assume that beforehand your monetized clicks were simply drifting into the dark abyss. I reiterate that the financial goal of this blog is strictly to lose money, I will drink away donate to charity whatever ad-pennies accumulate. Here’s the full, easy to browse archive of Putanumonit. I’m trying to get y’all to use the comments there for topic suggestions, so far to no avail. Remember how the tails of a bell curve drop off much faster than you ever imagined (and that’s why China is bad at soccer)? Even Francis Galton, a founding father of statistics and sampling theory didn’t fully grasp it. SlateStarCodex breaks down another abuse of statistics by the media, this time by a right-wing source to show political neutrality. Since I care more about meta politics than about politics I don’t worry about the latter, and I also assume a priori that any article on a political site is probably using numbers to bullshit. If I ever come across a political article that uses sound, well-interpreted data to make an unbiased analysis that would be newsworthy enough to write about. I’ll try to go after tougher targets: published scientific research and those in the media who should know better. Speaking of 538, Walt Hickey wrote a couple of pieces on estimating the number of Powerball tickets sold. His original model (which made perfect sense given the data at the time) was of exponential growth, where doubling the jackpot would more than double the number of tickets. Faster than linear growth would mean that there would be an optimum jackpot beyond which each participant’s winnings will decrease as the prize increases. However, the don’t bear out that case, with Wednesday’s Powerball topping out at 635 million tickets. That’s 3 tickets per adult in the United States, but still below exponential. I fit a linear regression model of the number of tickets sold based on the size of the jackpot and the number of news articles about Powerball for each drawing. I have included all data since Powerball switched to$2 tickets in January 2012. Data from 1/2012-4/2013 are adjusted for the fact that California joined Powerball on 4/10/2013 and has since accounted for almost 11% of ticket sales.

The model gives a baseline of 7.5 million tickets for a minimal $40m prize (actual minimum is around 10 million) and 170,000 extra tickets for every$1m in the pot. Every news article (which tend to materialize when the jackpot is the “biggest ever”) inspires 5 million tickets, or one extra winner per 57 news stories.

Based on these numbers, can a jackpot grow large enough to provide a positive expectancy on your money? Before I answer that, there’s another important point to examine.

In a lottery with 600 million tickets sold and 1 in 300 million odds of winning, we expect to have two winners on average. Imagine discovering that you have won the lottery, how many people do you expect to share the prize with? Many people can’t get over the intuition that there should be one more winner besides them, but that’s not the case: given that you won you should expect two additional winners!

The math is straightforward assuming that the tickets are independently generated: finding out that you have won gives you no information about the other 599,999,999 tickets. Each ticket has 1 in 300 million chances of winning so two of them (1.999.. if you’re nitpicking) will win, on average.

Here’s another way to look at it: before finding out that you won, you didn’t know how many winners there were going to be. For all you knew, there could be 0, 2 or even 10. Once you know that you have won, you know for certain that there weren’t 0 winners – there’s at least you. In fact, knowing that you won makes worlds with more winners relatively more likely. This is because the more winners there are the more likely you are to find yourself among them! Such is the awesome powa of Bayes’ Theorem. Here’s a of the theorem. The theorem itself is only useful for inconsequential things like learning anything at all based on evidence.

The charts above show the probability of Wednesday’s Powerball (635 million tickets, 1 in 292 million odds) having a certain number winners. On the left is the number of winners we could expect before the drawing: 2.17 on average. On the right is the conditional probability of having k winners given that you won, and it indeed averages to 3.17. Isn’t the discrepancy paradoxical? While each of the (few) winners will adjust their estimate of the number of winners upward from 2.17 to 3.17, each of the (numerous) losers can perform a similar calculation and adjust the expected number of winners slightly downward, to 2.169999. A small chance of updating strongly upward is balance by a huge chance of updating very slightly downward. Expected evidence is conserved, the paradox is avoided, and if you win you have to share.

So what would it take for a positive value Powerball? It would take a bit of algebra!

A $2 ticket grants a 1 in 292,201,338 chance of winning, so the expected first prize should be $\2 \times 292,201,338 = \584,402,676$ Unfortunately, even without state tax you would pay federal income taxes of 39.6% which means that your expected winnings will have to total $\frac{\584,402,676}{(1-0.396)} = \967,554,099$ You’ll expect to share the prize, so $\967,554,099 = \frac{jackpot}{1 + \#winners}$ Assuming all tickets are independent: $\#winners = \frac{\#tickets - 1}{292,201,338}$ Finally, we’ll use my linear model for the expected number of tickets$\#tickets= 650,000 + 5,000,000 \times \#news + 0.17 \times {jackpot}.$ The two mega-jackpots this month generated at least 100 news items (as counted by Google) between them, so we’ll assume 50 news stories for our presumed titanojackpot. This means a jackpot of$4.1 billion, which would provide break even value for each of the 950 million tickets sold!

Don’t hold your breath. Assuming that the government reinvests 80% of revenue (it doesn’t) and that it won’t change the rules along the way (it will), the chances of a run of jackpots going from $40 million to$4.1 billion with no winner along the way are 1 in 160,000. That means that we can expect the first positive-value Powerball to happen in the year 15,692 AD! And when we get there, we’ll find that some have beforehand.

## Footballinear Socceregression

This is part 2 of a series about the statistics of global soccer performance, part 1 is here.

#### A Picture with a Thousand Words on It

The point of this week’s post was to get to this chart, showing how good each country is at soccer independent of population and region of the world:

You are encouraged to stare at it until the names blur and Denmark tangoes with Costa Rica across your retinas. It is either the best or the worst chart I have ever concocted, it took forever and gave me a migraine. Here’s the story of how I got to it.

#### Correlation Does not Imply India

Last week we explored the puzzle of the Chinese national soccer team sucking. Naive calculations showed why it’s surprising that there aren’t 11 players out of 1.3 billion Chinese (or 1.2 billion Indians) that can compete with Iceland. Iceland has 300,000 people and is made of ice (citation needed) which is hard to play soccer on. A better understanding of how the normal distribution works at the extreme tails demonstrated that country population should have a limited effect on national team success. Exactly how limited is that effect in the real world?

The nefarious mob that organizes global soccer is FIFA, which, for being a corrupt cabal of callous criminals, keeps a surprisingly neat database of scores and rankings. Each team accumulates a score based on their result in every game played, the magnitude of the stage and the  strength of the opponent. I added up each nation’s scores for the last 11 years (2005-2015) to smooth out fluctuations.

1. Spain – 6930 points

2. Germany – 6401

3. Argentina – 5860

6. Brazil – 5808

22. USA – 4158

65. Cape Verde Islands – 2661

85. China – 2112

152. Liechtenstein – 901

154. India – 887 , with a population of 34,000 Liechtensteins.

I pulled demographics numbers from the UN’s yearbook and painstakingly calculated just the population of men aged 15-40 for soccer team purposes since some countries skew much younger/maler than others.

There are 209 soccer associations in the world (each has an equal vote in FIFA elections regardless of population or contribution, which means bribing the Tahiti delegate gets you as many votes as the USA one, which is how Qatar hosts the 2022 world cup). I cut the bottom ranked 42 nations from consideration because there wasn’t good ranking or population data on most of them. Sincere apologies to my devoted readers in Guinea-Bissau, you just missed the cut of countries I care about.

Here’s a chart of country populations and cumulative soccer points:

The blue line is the regression between points and population, it shows a slight positive connection. Except that it’s a fake. The regression includes 165 countries but leaves out India and China. Since my goal is to use math to figure out why India and China suck at soccer, I can’t just shout “outliers!” and ignore them. If all 167 countries are included, the correlation between population and soccer rating is an astonishing -0.002, a perfectly flat line.

Correlation measures the tendency of two variables to move together, how much one goes up relative to its own average when the other one does. If two variables are completely independent from each other their true correlation is 0, but if you just simulate 167 values of two independent variables you’ll measure a sample correlation of about 0.05 due to random noise. That tiny noise correlation is 25 times stronger than what we got for population and soccer! The only field that has correlations this small is astrology.

What is going on here?

Then this little section is for you! If you know what those are, move right along.

Correlation tells you if things generally move in the same direction (+1), in opposite directions (-1) or somewhere in the middle. For example, fuel economy for cars has a negative correlation with weight, maybe -0.7. This means that in general, heavier cars are likely to burn through more fuel.

Regression measures the magnitude of the correlation in the units that each variable is measured in. For example, the regression coefficient of car MPG on weight in pounds is -0.01 MPG/lb. The means that every pound added to the weight of a car comes with a 0.01 lower MPG on average, 500 pounds shave off 5 MPG.

Multiple regression measures regression for more than one variable at a time, holding the others constant. Let’s say we see that MPG is lower for heavier cars and for those with larger engines.  Multiple regression will tell you that adding 500 pounds by itself only subtracts 3 MPG when engine size is held constant, the other 2 MPG are simply due to heavier cars usually coming with larger engines which hurt fuel economy by themselves. Adding variables (like engine size) can completely change the size and even the sign of all other regression coefficients. Multiple regression also identifies variables that are significant (do a good job of predicting the outcome) and insignificant (those that don’t add anything once the other variables are known).

Regression is the basic tool of a lot of statistical analysis, prediction and machine learning. This Coursera class is one of many good places to learn it.

#### Six Corners of the World

As y’all know, correlation doesn’t imply causation. What’s often forgotten is that lack of correlation doesn’t imply lack of causation either. Implication is just hard, man. A variable like population could have its effect on soccer rating perfectly cancelled out by other variables, so that when the other variables aren’t accounted for, the correlation is 0.

Here’s an illustration of how that can happen that you can explain to your 5 year old nephew:

• X = number of testicles a human has
• Y = number of ovaries a human has
• Z = number of testicles+ovaries

If we measure across the human race, X and Z would have a correlation that’s very close to 0 since as X goes from 0 to 2 Z stays put at 2 for almost everyone. However, if you were born with 2 testicles, losing either or both of them will have an immediate and directly causal impact on your Z variable.

The average Qatari has 1.53 testicles and 0.47 ovaries, I still can’t get over that.

What’s the ovary to the testicles of country population? (This blog was an excuse to write that line). The first suspect is the region of the world. Asia has 6 of the world’s 10 largest countries, and also some of the ones least interested in soccer. FIFA divides all soccer federations into 6 regions that match geography with some weird exceptions: Europe (includes Israel and Kazakhstan), Asia (includes Australia), South America (only 10 countries), Rest of America (includes America), Oceania (includes Atlantis and the Mermaid Kingdom) and Africa. Teams get points for playing often and against strong opponents, so it’s much easier for a European country to accumulate points than it is for New Zealand, which faces a sparse yearly schedule of Tongas and Vanuatus.

We can tease out the effect of population as independent of region by performing with region as an added variable. The verdict is that Europe and South America have a much higher rating than the rest of the world, Asia and Oceania are lower. Holding region constant slightly increases the effect of population because the low rating of China and India is now partially accounted for by the Asia region. However, it’s still insignificant. Ovaries aren’t to blame.

#### Predictive Slovakiness

Regression measures linear effects, or the change in the effect (points) for a fixed-size change in causes. The effect of population is clearly not linear: adding a million people to a tiny country does much more than adding them to India. Instead of using whole population numbers, we’ll go back to s normal distribution model in which population affects soccer level by pushing up the quality of the top players, as measured by their distance away from the average. To go from population size to top player level we’ll need to answer the following technical question:

How many standard deviations above the mean will the best 11 players out of a population of N be on average?

It’s possible to answer that question precisely, but I’ll use a good approximation instead: for a country of N people I used as the score the point that has 5/N of the normal distribution above it. Slovakia has around N=1,000,000 men in soccer-playing age. The level that each Slovakian has an exactly 5 in 1,000,000 chance of being better than is the level 4.43 standard deviations above the mean. I took that number (4.43) as the best guess of the relative soccer level of the dudes that contributed to the Slovakian national team over the last decade.

The justification for using that approximation as the measure of population effect is way too rigorous for most readers and way not rigorous enough for real mathematicians; It’s an interesting discussion if you’re into that kind of freaky stuff, but I’m relegating to another post.

We can add the population-expected soccer level as a new variable to the regression to see if it does a better job of explaining the rating of each country. With the soccer level added, actual population size is no longer significant with or without China and India. This means that we can predict a country’s rating pretty well using just the expected level and the region. Here are the regression coefficients, significant in bold:

Intercept: -4748. Intercept is the starting point, it’s the expected rating of a country that has all the other variables set to 0.

Population-expected level: 1474
Africa: 147
Asia:-517
Europe 1529
North America: 508
South America: 1966

The range of countries’ rating goes from 0-7000, the average score is 2440. Going one standard deviation up in the population-expected level is worth a huge boost of 1474 points, but one SD marks a big jump in population: from 10,000 to 600,000 people or from 600,000 to 90 million. Each region’s coefficient measures how many points an average a team from that region is better than Oceania. Europe and Latin America are good at soccer and everyone else isn’t.

The regression can be thought of a prediction, how good we can expect a country to be based just on the variables we look at: expected level of top players and region. The predicted score for a country is calculated by multiplying the coefficient of each variable by the country’s score on that variable.

 Variable Coefficient Slovakia’s Score Change in points Intercept -4748 -4748 Pop-expected level 1474 4.43 6532 Europe 1529 1 1529 Total score expected from the regression: 3313

For example, Slovakia starts from the intercept of -4748 points (like everybody), gets 1474 points for each of 4.43 population-levels and another 1529 points for being in Europe. This comes out to an expected scoreof 3,313 points. Their actual rating is 3,566, so it means that they outperform expectations by 250 points, those points come either from random luck or from some variables that we didn’t account for yet like the famous Slovakian warrior spirit. That’s good news, because otherwise Martin Skrtel would come and stomp on my face for disparaging his great footballing nation. 250 is also a pretty small prediction error compared to the total score, so the Slovariables do a good job of accounting for the Slovariance.

The chart I started with shows how good each country is relative to what we could expect from their region and size. Higher up countries outperform expectations (for reasons we don’t know yet) and those lower down disappoint:

So much fascinating stuff! What do Switzerland and Costa Rica have in common besides being fairly good at soccer with medium populations? Iran and Mexico have similar populations, similar ratings and similar green-white-red uniforms; are they actually the same country? Why does the tiny island of Malta languish, and the tiny Cape Verde Islands kick ass? Is Qatar really  good or do they bribe game officials? What makes Spain and Germany kings of soccer in the 21st century? Wait, someone actually wrote an excellent book about the last one.

All the answers (to life, not just soccer ability) are revealed in the riveting conclusion: The Rich, the Tall, and the Bees.