If we have data, let’s look at data. If all we have are opinions, let’s go with mine. – Jim Barksdale
This blog started with a mathematical analysis of what this blog should be about, and a promise to explain why p-values aren’t part of that analysis. 14 months later, I’m bringing Putanumonit full circle by doing a mathematical analysis of what this post should be about, and explaining in detail why p-values suck.
In case you were worried (or hopeful), bringing the blog full circle doesn’t mean it’s ending. I plan to keep going in a straight line after closing the circle, which should make the entire blog look like the digit 6.
A couple of weeks ago I asked my readers for some greatness-enhancing feedback via survey and email. A lot of the readers who emailed me think that they are much bigger math geeks than the other readers who emailed me, which is pretty funny. But one reader actually proved it by putting a num on it:
I am a huge math nerd, so my interests and knowledge when it comes to this sort of thing are probably at least four standard deviations from the mean of humanity, and at least two from your reader base. – Eliana
Eliana’s email is wonderful for two reasons:
- It says “the mean of humanity” where most people would say “most people”.
- It makes an explicit, testable prediction: that my readers are 2 standard deviations math-nerdier than the mean of the human species.
If math-nerdiness is normally distributed, at +2 SDs my average reader is math-nerdier than 97.7% of humanity. Did you know that math is teens’ favorite subject in high school, year in year out? If normal humans admit (at least in anonymous surveys) to love studying math, surely my readers are on board as well.
Yep. If I was worried about making Putanumonit too math-nerdy, now I’m worried that it’s not going far enough for my readers. Last week I wrote a post about eating fish and my commenters responded with detailed analysis, math jokes, mental calculation tricks and extensions to famous math puzzles. I feel a bit inadequate, my favorite subjects in school were gym and French.
What’s the point?
The only metrics that entrepreneurs should invest energy in collecting are those that help them make decisions. Unfortunately, the majority of data available in off-the-shelf analytics packages are what I call Vanity Metrics. They might make you feel good, but they don’t offer clear guidance for what to do. – Eric Ries
There’s a point to the opening part of this post, and that’s to demonstrate that data analysis should have a point. The point of the survey above was to help me decide if I should write more or less mathematical detail in my posts, and it doesn’t take much analysis to see that the answer is “way more”. OK then, I certainly will.
We’ll get to some of the other survey results, but I first want to step back and take a broader view of data analysis. In that view, almost all data analysis you see is conducted with one of two goals in mind:
- Drive a decision.
- Get published in an academic journal.
You may have heard of the war raging between the armies of Bayesian and frequentist analysis. Like all global conflicts, this war reached its peak with a clever XKCD comic and its nadir with a sensationalistic New York Times story.
The skirmishes in this war today consist mainly of Bayesians showing frequentists why their methods (i.e. p-values) are either wrong or secretly Bayesian, and salty frequentists replying that the methods work just fine, thank you. In some sense they’re right: as long as academic journals are happy to accept papers based on p-values, getting a p-value is a great way of achieving that goal. It’s the first goal where frequentist analysis suffers, and a more mindful technique is needed.
Properly done, decision-driven analysis should work backwards from the question to the data:
Step 1 – Formulate the question or decision you want to answer with data.
Should I trim my posts to make them shorter or run with longer articles?
Step 2 – Quantify alternative possibilities, or possible answers to the question. Each alternative is a model of how the world looks like in the narrow realm you’re interested in. The goal of the data is to promote one of the alternatives above the rest, so each alternative should correspond to a decision option or an answer to the main question.
Going from the decision I want to make, I will change the length I aim for in my posts if every fourth additional reader will be pleased by it. Specifically:
Alternative 1 – I would write longer posts (decision) if 25% more of my readers prefer longer posts to shorter ones (possible world).
Alternative 2 – I would write shorter posts (decision) if 25% more of my readers prefer short posts to longer ones (possible world).
Alternative 3 – I would keep the same length of posts (decision) If about the same percentage of readers like and dislike long posts (possible world).
It’s very likely that actual gap between lovers of long and short posts will fall somewhere between 0% and 25%, but I’m not interested in precisely estimating that gap. I’m only interested in making a decision, and since I have three decision options I can make do with only 3 alternatives. I will make the decision based on the alternative that seems most likely after everything is accounted for.
Step 3 – Put a number on how likely each alternative is. This should be based on your background knowledge, whatever information you have before the new data comes in. This is called the prior, and for some reason it makes more people uncomfortable than Richard Pryor. We’ll get to both later.
My most popular posts were on the longer side, but not overwhelmingly so. I think people probably like both the same, and maybe lean towards longer posts. My priors are 30% for Alternative 1 that longer is better (that’s what she said), 10% for Alternative 2 and 60% for Alternative 3.
Step 4 – Use the new evidence to update your prior probabilities in accordance with Bayes’ rule and arrive at the posterior, or the outcome. Make your decisions accordingly.
The verbal logic of Bayes’ rule is that whichever alternative gave the highest probability of seeing the evidence we actually observed is the one most supported by the evidence. “Support” means that the probability of the alternative increases from its prior. I’m going to demonstrate the use of Bayes’ rule, but I’m not going to explain it further. That’s because Arbital’s guide to Bayes’ rule does such an excellent job of explaining it to any level of skill and need that I don’t want to step on their turf. If you have any doubts, big or small, about what the rule says and why it works go ahead and spend some time on Arbital. I’m not going anywhere.
You’re back? Good. I assume you’re hip to Bayes now and we can jump straight into the calculation.
We have three competing hypotheses and prior probabilities on all three. I’m going to summarize the prior probabilites in odds form, because that’s the easiest form to apply Bayes’ rule to. Our odd ratios for Alternative 1 (longer is better):Alternative 2 (shorter):Alternative 3 (same) are 30%:10%:60%, or 3:1:6.
All we need now is the evidence. I have combined a couple of survey question and their responses to get this basic breakout:
|Longer is better||131||34%|
|Shorter is better||72||19%|
The gap between people who prefer longer posts to shorter ones is 34% – 19% = 15%. That wasn’t one of our three alternatives, but it’s definitely closer to some than to others. Ideally, instead of picking a few discrete possibilities we will have a prior and a posterior on a continuous distribution of some parameter (like the long-short gap). I’ll go through a continuous-parameter analysis in the next post.
We now need to calculate the probability of seeing this result given each of the three alternatives, P(131-72 long-short split | alternative N). I whipped up a very quick and dirty R simulation to do that, but with a couple of assumptions we can get that number in a single Excel line. To make life easy, we’ll assume that the number of people who show no preference is fixed at 183 out of the 386, so our alternatives boil down to how the other 203 readers are split.
Alternative 1 – There’s a 25% advantage to long posts, and 25% of 386 responders is 97. So this alternative says that out of the 203 who show a preference, 97 more will prefer long posts. This means that these 203 readers will be split 150 in favor of long, 53 in favor of short. 150/203 is 74%, so this alternative predicts that 74% of 203 prefer long posts.
Similarly, Alternative 2 says that 53 of 203, or 26% prefer long posts, and Alternative 3 says that 50% of them do (i.e. an equal number prefer long and short). Now we simply treat each responder as making an independent choice between long and short with a fixed probability of 74%, 50% or 26%. For each option, the chance of getting our observed 131-72 split is given by the binomial distribution which you can calculate in Excel or Google Sheets with the BINOMDIST function.
P(evidence | Alternative 1) = P(131 long – 72 short | 74% for long) = BINOMDIST(131,203,0.74,0) = .0007
P(evidence | Alternative 2) = P(131 long – 72 short | 26% for long) = BINOMDIST(131,203,0.26,0) ≃ 0, as far as Google Docs can calculate. Let’s call this ε.
P(evidence | Alternative 3) = P(131 long – 72 short | 50% for long) = BINOMDIST(131,203,0.5,0) = .0000097
The likelihood ratios we get are 0.0007 : ε : 0.0000097 = 72:ε:1. We’ll deal with the ε after calculating the posterior odds.
Posterior odds for alternatives 1:2:3 = prior odds * likelihood odds = 3:1:6 * 72:ε:1 = 216:ε:6. We can go from odds back to probabilities by dividing the odds of each alternative by the total odds. 216 + 0 + 6 = 222 so:
Probability that a lot more people prefer long posts = 216/222 = 97%.
Probability that an equal number prefer long and short posts = 6/222 = 3%.
The probability that a lot more people prefer short is bounded by the Rule of Psi we formulated in the last post:
Rule of Psi (posterior edition)- A study of parapsychological ability to predict the future produced a p-value of 0.00000000012. That number is only meaningful if you have absolute confidence that the study was perfect, otherwise you need to consider your confidence outside the result itself, i.e. the probability that the study is useless. If you think that for example there’s an ε chance that the result is completely fake, that ε is roughly the floor on your posterior probabilities.
There’s at least a 0.1% chance that my data is useless. Either I formulated the questions wrong, or the plugin didn’t count them, or someone who loves long posts voted 200 times to confuse me. This means that we shouldn’t let the data push the probability of any alternative too far below 0.1%, so we’ll set that as the posterior for more people preferring short posts.
Our simple analysis led us to an actionable conclusion: there’s a 97% chance that the preference gap in favor longer posts is closer to 25% than to 0%, so I shouldn’t hesitate to write longer posts.
What’s important to notice is that this decision is driven completely by the objective evidence, not by the prior. Let’s imagine that prior had been pretty far from the truth: 10% for longer posts, 45% for shorter, 45% for equal. The posterior odds would be: 10:45:45 * 72:ε:1 = 720:ε:45. The evidence would take Alternative 1 from 10% to 720/765 = 94%. I would have had to give the true alternative a tiny prior probability (<1%) for the evidence to fail to promote it. 30%:10%:60% was a pretty unopinionated prior, I made all the probabilities non-extreme to let the evidence dictate the posterior.
The prior doesn’t have to be perfect, it just needs to give a non-tiny probability to the right answer. When using Bayesian inference, sufficient evidence will promote the correct answer unless your prior is extremely wrong.
Bottom line: the evidence tells me to write long posts, which the survey defined as having 3,000 words. I’m only 1,800 words in, so let’s take a second to talk about art.
The Art of Data Science
A lie is profanity. A lie is the worst thing in the world. Art is the ability to tell the truth. – Richard Pryor (I told you we’d get back to him)
I chose this quote not just because of Pryor’s very Bayesian surname, but also because I wanted to write the first essay ever that quotes both Richard Pryor and Roger Peng, two men who share little in common except for their initials and a respect for truth. Peng created the JHU data science specialization on Coursera (well explained and not very challenging) and wrote The Art of Data Science. Here’s how he explains the title of the book:
Much like songwriting (and computer programming, for that matter), it’s important to realize that data analysis is an art. It is not something yet that we can teach to a computer. Data analysts have many tools at their disposal, from linear regression to classification trees and even deep learning, and these tools have all been carefully taught to computers. But ultimately, a data analyst must find a way to assemble all of the tools and apply them to data to answer a relevant question—a question of interest to people.
I made several choices in the previous section that could all be made differently: picking the three alternatives I did, or using a simplified binomial distribution for the likelihood. I made these choices with the goal of trying to tell the truth. That involves getting close to the truth (which requires using good enough methods) and being able to tell it in a comprehensible way (which requires simplifying and making assumptions).
The procedure I outlined isn’t close to the Bayesian ideal of extracting maximum information from all available evidence. It’s not even up to industry standards, because unlike many of my readers I am not a professional data scientist (yet, growth mindset). But the procedure is easy to emulate, and I’m pretty sure the answer it gave me is true – people want longer posts.
And yet, this is very useful procedure is only taught in the statistics departments of universities. Cross the lawn and you’ll find that most other departments, from psychology to geology, from business school to the med school, teaches something completely different: null hypothesis testing. This involves calculating a p-value to reject or accept the mythical “null hypothesis”. It’s a simple method, I learned it well enough to get a a perfect score on the statistics final, and the following year I helped indoctrinate new students into null hypothesis testing as a teaching assistant. But since then I spent a good while thinking about it, and I came to realize that null hypothesis testing is hopelessly inadequate in getting close to the truth.
The Null Method
Null hypothesis testing fails to divulge the truth for three main reasons:
- It asks “is the null hypothesis accepted or rejected?”, which is the wrong question.
- It calculates a p-value, which is the wrong answer.
- Both the null and the p-value aren’t even relevant to what the truth actually looks like.
These are some bold claims, so let’s inspect them through the example of a serious research study, and not just my piddling survey. Here’s a study about differences in intelligence between the first and second-born sibling. What does null hypothesis testing make of it?
Hypothesis testing only knows to ask one question: Is the null rejected? The answer to this is a resounding “yes”: Older siblings are smarter because the null hypothesis of equal IQ between siblings is rejected with a p-value of 0.00000001. So many zeroes! Hold on while I call my younger brother to inform him of my intellectual superiority.
But wait, isn’t a better question to ask is: How much smarter are older siblings? The answer is 1.5 IQ points. That is smaller than the 4-6 point average difference between two IQ tests taken by the same person two years apart. It’s a mere fraction of the 30 point hit a person’s IQ takes the moment they start typing in a YouTube comment. The answer to the better question is: imperceptibly.
Another question could be: What are the chances that in a given family the older sibling has higher IQ? The answer to that is 52%, practically a coin flip. Any relevant information will tell you more about relative IQ than birth order, for example the fact that my brother studies chemistry and writes poetry, while I got an MBA and write about Pokemon.
So how did the study get such an impressive p-value? Because their sample size was 3,156. With such a huge sample size, you can get “statistical significance” with or without actual significance. Whatever your goal was in studying birth order and intelligence, answering “is the null hypothesis rejected?” is usually useless in achieving that goal.
The answer you get, a p-value, is pretty meaningless as well. Does a p-value of 5% means that you’re wrong in rejecting the null 5% of the time? Nope! You’re going to be wrong somewhere between 30% and 80% percent, depending on the power of the study. Does it even mean that if the null is true you’ll see the same result 5% of the time? Nope again! That’s because the p-value is calculated not as the chance of seeing the actual effect you measured, but as the chance of getting a result at least as big (technically, as far from the null) as the measured effect.
Calculating anything for the group of “effects at least as big as the measured effect” is absurd for two reasons:
- This group isn’t representative of the measured effect because the measured effect is an edge case within that group. In the IQ example, the p-value is the chance, given that the null is true and also given a bunch of assumptions, that the older sibling is smarter by 1.5 points, or 2 points, or 27 points, or 1500 points. It’s a huge range of outcomes, most of which are a lot farther away from 1.5 than 1.5 is from 0 (the null). If we assumed 0 and got 1.5, why are we calculating the probability of getting 27?Moreover, the assumptions that underlie the p-value calculation (such as a normal distribution) are very hard to justify over that huge range of 1.5 to infinity. Many distributions resemble a normal bell curve close to the middle but look very different in the tails. The “range of IQ differences bigger than 1.5” (the right tail) looks little like the data we actually have which is “an IQ difference of exactly 1.5”.
- This group for the most part doesn’t even include the actual effect, which is usually smaller than the measured effect. If you measured a positive effect, the measurement error was likely in that positive direction as well. When we subtract the error from the measured result to get the true result, we get a small number which is completely outside the range of the bigger numbers we measured the p-value on.
The lower the power of the experiment, the worse this problem gets, and power is often pretty low. With the graph below Andrew Gelman shows just how bad this gets when the statistical power = 0.06, which is what the “Ovulating married women vote Republican” study had:
This is why in the likelihood calculation I used the precise result of “exactly 131 people prefer long posts”, not “131 or above”. P-value calculations don’t work with point estimates, but “or above” ranges will break even likelihood calculations.
Let’s try this on the survey results. If we tried null-rejection testing, whether I chose “50% prefer long posts” or “74% prefer long posts” as my null hypothesis, the actual result of “131/203 prefer long posts” would reject the null with p<0.002. Again, rejecting the null doesn’t tell you much.
When I calculated the likelihood given by “131 prefer long”, I got a ratio of 72:1. If instead I had calculated the likelihood based on “131 or more prefer long”, the likelihood ratio in favor of Alternative 1 would have been 47,000:1. Here’s the fun part: if I had looked at “131 or below”, the likelihood ratio would have been 1:1730 in favor of the opposite alternative!
Choosing a direction for the range of outcomes can arbitrarily support any hypothesis you come up with, or reject it. But that’s exactly what having a single null hypothesis does, it defines an arbitrary direction: if the measurement landed above it you look at that result or above, and vice versa. That to me is the main issue with null hypothesis testing – a single hypothesis with no alternatives can’t bring you to the actual answer.
When you have several alternatives, gathering more evidence always brings you closer to the truth. “25% more people like long posts” isn’t the exact truth, but it’s probably closer than the other two options. When you have a continuous distribution of alternatives, you do even better as the evidence zeroes the posterior in on the real answer.
But if you just had one null and you rejected it, your conclusions depend more on the choice of null (which can be quite arbitrary) than on what the evidence tells you. Your company released a new product and it sold 10,000 copies, is that good or bad? That entirely depends on whether your null hypothesis was 5,000 copies or 20,000. Should the null hypothesis be the average sales for any product your company makes? But this product had a smaller niche, a larger advertising budget, and twice the R&D cost. It makes no sense to compare it to the average.
The best you can do is to incorporate all available information about the customers, the price, the advertising and the cost. Then, use all those factors to calculate a predicted sales volume, along with a range of uncertainty around it. But that’s not a null hypothesis anymore, that’s the prior, the one that frequentists don’t like because it feels subjective.
Null-worship isn’t a strawman, it happens to Nobel-caliber scientists. If you remember my essay about hot hand shooting, Tom Gilovich and Amos Tversky hypothesized that basketball players shoot 0% better on their second free throw, and made that the null hypothesis because 0 is a nice round number that pretends to be objective. A bunch of random fans estimated the difference to be 8%. The actual answer came out to 4%, and the scientists concluded that there’s no improvement because their sample was too small for the null to be rejected at p<0.05!
In absolute terms, both fans and scientists were off by 4%. In relative terms, the fans overestimated the effect by a factor of 2, while the scientists underestimated the effect by a factor of infinity. Gilovich and Tversky didn’t declare victory because they were closer to the truth, but because they got to pick the null.
Is there any use to null hypothesis testing at all? Yes, but it’s the opposite of how most people use it. In common practice, if someone rejects the null, they declare victory and publish. If they don’t reject the null, they p-hack until the null is rejected and then declare victory and publish.
Instead, null hypothesis testing can be a quick check to see if your data is worthwhile. If you fail to reject the null, your data isn’t good enough. There’s too much noise and too little signal, you need better data or just a lot more of it. If you do reject the null, your data might be good enough to yield an answer once you throw out the p-value and start doing some actual statistical inference.
This post was nerdy, and it was long, and it’s time to wrap it up.
Data analysis is both an art and a science. There’s no single recipe for doing it right, but there are many ways to do it wrong, and a lot of the latter ways involve p-values. When the questions falls to you, stick to a simple rulebook:
Rule 1 – Focus on the ultimate question you’re answering, the one that drives the decision. For every step in the analysis, ask yourself – is this step helping me answer the ultimate question?
Rule 2 – Before you see the data, come up with alternative hypotheses and predictions. Base these predictions on your knowledge of the issue, not on objective-looking round numbers.
Rule 3 –
Featured image credit: Shadi Yousefian