I’m struggling a bit with the next post in the dating sequence. I can’t put any numbers in it, only some pretty personal stories and questionable ideas that touch on touchy subjects. I’m really not sure I should even write it. That leaves me with two options:
- Not write it.
- Before I write it, scare my entire readership away with a 4,000 word mathematical geek out about a 30 year old research paper, that’s going to include a Monte Carlo simulation of 320,000 jump shots.
If you know me at all, you know this isn’t much of a dilemma.
This post is tagged rationality because it deals with biases in decision making. It’s also tagged defense against the dark arts: we’re going to see how even good research papers get statistics wrong sometimes, and how smart statistical skepticism when scrutinizing science studies can save your skin.
Who Has the Hot Hand?
Steph Curry hits a 3 pointer, the crowd cheers. The next trip down the floor he hits another, from the corner. The buzz in the building rises in pitch, another shot with a hand in his face… swish! The crowd is standing now, screaming, everyone feels that Curry is on fire. “The basket is as big a barn door to Steph now,” the announcer is giddy, “He can’t miss!” Curry catches the ball at the top of the arc and releases another three pointer, the ball curves smoothly through the air. There’s no way he’s going to miss, is there?
The hot-hand fallacy is the intuitive tendency to assume that people will continue to succeed after a row of successes, even when success and failure come from a random process. It was was explained in Gilovich, Vallone and Tversky in 1985, who concluded that there’s no evidence for “hot hands” or “streak shooting” actually happening in basketball. They classified the belief as a fallacious heuristic – a mistaken judgment. The publication followed a stunning string of groundbreaking papers on heuristics and biases by Amos Tversky, who won the Nobel Prize later for his work with Daniel Kahnemann. I guess everyone just kinda assumed that Tversky’s hot streak must keep going and he can’t be wrong on this one.
Gamblers vs. Streakers
The first weird thing about the hot-hand fallacy (HHF) is that it’s supposed to be the same thing as the gambler’s fallacy (GF), except with precisely the opposite outcome. Gambler’s fallasticians believe that a coin that landed heads several times in a row is more likely to land on tails on the next flip because it’s “due”. GF really victimized the roulette players who kept betting on red when the wheel landed on black 26 times in a row in Monte Carlo in 1913. Wouldn’t it have actually made sense for these gamblers to switch to “hot hand betting” and double down on black? Besides being the intuitive thing to do after seeing 20 blacks in a row, pure Bayesian rationality would also seem to point that way. My prior on the roulette wheel being broken to bias black is not large, but it’s larger than 1 in 1,000,000. 1 in a million is less than the odds of seeing 20 spins on the same color in a row with a fair wheel. After 20 blacks, I wouldn’t be going back.
Supposedly, both GF and HHF come from the representativeness heuristic: people don’t believe a 50-50 variable should have long streaks of one outcome because heads-heads-heads-heads-heads doesn’t look like a fair coin. If the coin is known to be fair, people believe that the next flip “must” land tails to make the streak more “fair like” (GF). If the “coin” is actually Steph Curry, people decide he’s not actually 50-50 to hit his next shot but rather that he morphed into a 90% shooter (HHF). So: GF for objects and HHF for people.
But wait, do people expect Steph to continue hitting 90% for the rest of his career? Of course not! After a single miss everyone will assume he’s back to 50-50. I’m still not sure how, according to the theory, people decide if they’re going to HHF and assume a Curry basket or GF and assume he’s due for a brick. Maybe they flip a fair coin to make up their mind.
Croson and Sundali (2005) figured out how to expense a fun weekend to their research grant and took a trip to a Nevada casino to see the fallacies in action. Gambler’s fallacy with actual gamblers is a slam dunk: after six spins on the same color, 85% of roulette players bet the other way. The sample size isn’t huge, but the data seems unequivocal:
The evidence for hot hand abuse is a bit slimmer.
Croson and Sundaly: Of our 139 subjects, 80% (111) quit playing after losing on a spin while only 20% (28) quit after winning. This behavior is consistent with the hot hand; after a win players are likely to keep playing (because they’re hot).
The information above is utterly useless without knowing the base rate of winning on a spin, if 10% of spins won it would mean than people quit more after winning! I figured out that number from other data in the paper: it’s 33% (1024 / 3119). Also, 100% of the times someone quits because they ran out of chips happen after a losing spin. How many of the people who quit did so because they lost their last chip? If it’s more than 13% (the difference between the 67% who lost their last spin and the 80% who quit after losing), the conclusion isn’t only annulled, but reversed.
Spoiler alert: research papers that contain the data of their own refutation is going to be a theme today. Stay tuned.
The main support for the hot hand effect comes from this regression on the right. People who just won place 1 additional bet, 12 on average instead of 11, when controlling for bets placed on the first and previous spins. The researchers couldn’t directly observe the amounts, just the number of different bets, and used that as a proxy. Obvious counterargument is obvious, Croson and Sundaly make it themselves:
There are alternative explanations for these behaviors. For example, wealth effects or house money effects might cause an increase in betting after a win. In our empirical data we will not be able to distinguish between these alternative explanations although previous lab experiments have done so.
When people are doing something weird, I often prefer to assume irrationality rather than conjuring convoluted utility functions to explain away the behavior (can I call that the Economist’s Fallacy?) In this case, however, putting one extra bet after having won seems quite reasonable. That’s how you would bet if, for example, you were trying to manage risk in order to play for a fixed number of spins and then get lunch. Whatever you think of the rationality level of people spending their day at the roulette table, it’s really hard to see much hot-hand fallacy there.
Shooters Gonna Shoot
Bocskocsky, Ezekowitz and Stein (2014) couldn’t get a trip to Vegas approved, so they went back to basketball armed with some great data that wasn’t dreamt of in 1985: cameras that track each player and the ball every second of the game. Without the distraction of Vegas, the researchers first developed a full model of shot difficulty that incorporates everything from the angle of the nearest defender to the time remaining in the game.
Armed with that model, the paper shows that “hot” players take slightly more difficult shots (e.g. from further away and in the face of tighter defense). Controlling for difficulty, players do shoot better after a few makes but not enough to make up for the increased difficulty. However, if a player who just made two shots takes another one of the same difficulty fans are right to expect a 2.4% better chance of sinking the shot compared to a player coming off two misses.
The bottom line is, the study is excellent but not very exciting. It concludes that the hot-hand isn’t a fallacy while also calculating that players are better off shooting less after a streak. Fans are justified to expect a hot streak to continue if they adjust for shot difficulty, but not if they don’t.
Finally, the Andrew Gelman-shaped Angel of Statistical Skepticism (ASS) on my shoulder reminds me to watch out for beautiful gardens that hide many forking paths. Even being completely honest and scrupulous, the researchers have a lot of small choices to make in a research project like this. Which of the dozens of variables to include in assessing shot difficulty? Which measures of “heat” to focus on? Which parameters to include the regression? Every choice makes perfect sense in the moment, but the fact is that those choices were available. A slightly different data set could have pushed the researchers towards doing a slightly different analysis that would’ve found statistical significance for some other result. A tiny effect size plus a multitude of “researcher degrees of freedom” make me think that the 1% p-value on the main finding is probably no better than a 5%, and 5% p-values are wrong at least 30% of the time.
I think that Bocskocsky, Ezekowitz and Stein did a great job and I certainly don’t believe they were in any way dishonest, but I’d be very happy to bet at 100-1 odds that their 1% p-value will not replicate.
The Hot Hand Bias Bias
Why even spend hours fitting models to data when you can do some arithmetic and turns other people’s data against itself? Miller and Sanjurjo (2015) did just that, and almost made $200,000 off a hedge fund guy while at it.
Miller and Sanjurjo noticed that even for a perfectly random variable, any limited sequence of observation is likely to show an anti-hot-hand bias. This confounds attempts to detect hot hands and contributes to the gambler’s fallacy. For illustration, lets look at sequences of 3 basketball shots. We assume that every player hits 50% of their shots, so each one of the 8 sequences is equiprobable. For each sequence (imagine it’s 8 players), we’ll calculate the percent of made baskets followed by another make.
We assumed that every player hits 50% of their shots no matter what, but somehow the average player makes 41.7% of their shots after a made basket! The discrepancy comes from the fact that 50% is averaged across all shots, but 41.7% is averaged across all players. Changing the aggregation or averaging level of your data can not only mess up your finding, but also flip it and reverse it.
If you bet against streaks continuing on the roulette, you will win most days but on the few days you lose, you’ll lose a lot. If, like Gilovich and Tversky, you look at a lot of basketball players, most players will appear to shoot worse after a streak but the few that shoot better will shoot much better. That better percentage will also continue over more shots since those players will have more and longer streaks.
Gilovich let 26 basketball players from the Cornell varsity teams shoot uncontested jump shots from a distance at which each players shoots 50%. He found an insignificant 4% increase in shooting after 3 makes vs. after 3 misses. Miller and Sanjurjo apply their correction to the original 1985 data, and calculate an implied difference of 13%!
The only question is, why apply corrections to poorly aggregated data when we can just change the aggregation level directly?
Data of Their Own Demise
To their credit, Gilovich, Valone and Tversky not only went out to the gym with the varsity teams (can you imagine Calipari’s Wildcats participating in a statistics study?) but also provided the full data of their observations and not just the percentages:
As we saw, averaging across all players finds a gap of 4% (49% vs. 45%) in shooting after a hot streak vs. a cold streak. The numbers in parentheses are the actual shots taken, using these along with the shooting percentages allowed me to reverse engineer the data and calculate total makes and misses after streaks.
- After 3 misses: 161 out of 400 shots = 40%.
- After 3 makes: 179 out of 313 shots = 57%.
That 17% is a humongous difference, equal to the difference in 2-point shooting between the second best player in the NBA this season and the fourth worst. The difference disappears in the original study because of aggregation levels. When you aggregate by players, the super-streaky Male #9 (48% gap) counts the same as his consistent friend, Male #8 (7%). However, dude #9 took four times as many post-streak shots as his buddy, when that data counts four times as much the shooting gap emerges clear as day.
Gilovich also looks at free throw shooting data by the Celtics and again goes to considerable lengths to avoid seeing evidence of hot-hand shooting:
Gilovich starts by asking a bunch of supposedly ignorant and biased basketball fans to estimate the shooting percentage of a 70% average free throw shooter after a make and a miss. They estimate an average gap of 8%: 74% vs. 66%. Instead of looking at the gap directly, Gilovich calculates a correlation for each player, finds that none of them are significant, and happily proclaims that “These data provide no evidence that the outcome of the second free throw is influenced by the outcome of the first free throw” (Gilovich et al. 1985).
If you ask me, the evidence that the data provide is that players hit 428/576 shots after a miss (74.3%) and 1162/1473 after a make (78.9%) for a nice 4.6% gap.
Oh no, Gilovich objects, not so fast: “Aggregating data across players is inappropriate in this case because good shooters are more likely to make their first shot than poor shooters. Consequently, the good shooters contribute more observations to P (hit/hit) than to P (hit/miss) while the poor shooters do the opposite, thereby biasing the pooled estimates” (Gilovich et al. 1985).
Good point there, Dr. Gilovich, but remember that you asked the fans about 70% shooters specifically. We can avoid the good shooter/bad shooter bias by grouping players with identical FT%. As fate would have it, Paris, Ford, McHale and Carr all shoot between 70.5% and 71.2%: almost identical and close to 70% (I calculated each player’s exact shooting data from the number of shots and percentages in the table). These four players shoot 3.2% better after a make than after a miss.
Is a 3-4% gap significant? Who cares, the word “significant” is insignificant. A pernicious mistake that scientists constantly make is assuming that every rejection of the null is confirmation for the alternative. The fact that the data is unlikely under the null hypothesis doesn’t mean it’s any likelier under some other model. Here, Gilovich et al. make the flipped mistake: assuming that failure to reject the null hypothesis (0% gap after a make) confirms the null is true. However, the naive alternative (fan estimate) was an 8% gap. You can calculate p-values from now till the Sixers win, it doesn’t change the fact that 4% is as close to 8% as it is to 0%. The kind of statistical malpractice where a 4% result rejects the 8% hypothesis and confirms the 0% one is why some Bayesians react to frequentists with incandescent rage.
Rage aside, I’m left with a dilemma. On the one hand, disagreeing with Amos Tversky probably means that I’m not so smart. On the other hand, the Cornell students shot 17% better after a streak of makes and Tversky’s friends concluded “no effect”. Screw it, argument screens off authority: The hot-hand fallacy is dead, long live the hot hand!
The Streak is the Signal
Summary so far: research paper that claims that hot-hand shooting exists finds a 2% improvement in shooting after a streak, research paper that claims that hot-hands are bullshit finds gaps between 3% and 17%. Science FTW!!!
Even if the data was straightforward, it’s still just correlations and regressions. Without a plausible mechanism to explain the effect I trust it only as far as I can throw it. So why does hitting shots make you hit more shots? The announcers usually babble something about confidence or “being in the zone”, but I can’t throw announcers really far and I trust their analysis even less. If you’ve seen Steph Curry or Larry Bird shoot, you wouldn’t doubt that they’re 100% confident in every single shot they take.
It turns out there’s a remarkably simple answer that accounts for the hot-hand effect: all you need is a player having a priori different shooting percentages in different games. The simplest model assumes that each shot a player takes has the same odds of going in, but what if a player has games where something makes his shooting percentage higher or lower independently of streaks?
Let’s look at Kevin Durant, a dude who’s pretty good at shooting basketballs. He takes 20 shots a game and makes 50% of them over a season. In a specific game, however, Durant may have defensive player of the year Kawhi Leonard inside his shirt and shoot 32%. The next game, he’s guarded by octogenerian Kobe Bryant and something called Larry Nance Jr., he shoots 78%. Even if we assume that Durant’s shooting percentage doesn’t change throughout the game, in games where he shoots a higher percentage he’ll also get more streaks, and more attempts at shot-after-streak.
To see this in action, I simulated 1,000 games for Durant and counted the shots made and missed after 3 hits in a row. I simulated 20 shots in each game, but in 500 of them his shooting percentage is set to 60% and in the other 500 it’s set to 40%.
|# of Games||FG%||Streaks of 3||Make after streak||Miss after streak||After streak FG%|
|60%||1872||1124 (60%)||748 (40%)||1333 / 2423 = 55%|
The chance of making a shot after a streak is either 60% or 40% depending on the game, but more than three quarters of the streaks happen in the 60% games. Every shot made after a streak gives another opportunity for a hot-hand shot, in a couple of the simulated games Durant makes 9 or 10 shots in a row! Because of that, even though his overall shooting percentage is exactly 50%, Durant’s shooting percentage after a streak is 55%. The fans are justified to expect a hot hand after 3 makes: the streak doesn’t cause the higher scoring chance, but it sends a signal that Durant is having a high FG% game. We have a perfect explanation for hot hands without any (hot) hand waving about “confidence” and “zones”.
Deviations of Deviations
The “variable shooting” theory is simple, elegant and explains the hot-hand shooting gap perfectly. Researchers take note: if you have a beautiful theory, don’t risk it by exposing it to ugly data! Oh, what the hell, I’m not getting paid for this anyway.
We can’t directly tell from someone’s shooting success what the underlying percentage was in a particular game, and we’re looking for evidence that the underlying percentage actually differs from one game to another. A consistent 50% shooter (no variability in underlying percentage) will still hit 3/9 on a bad day or 12/16 on a lucky outing. However, he’ll have less games where he shoots a number that’s very different from 50% than someone who alternates games with 60% and 40% underlying probabilities. We can find indirect evidence for game-to-game fluctuations by looking at how variable the game-to-game actual shooting percentage is. The higher the observed variance, the more evidence is shows for underlying variance. The question is, how much higher should it be?
A player’s field goals in a game follow a Binomial Distribution with parameters n=number of shots and p=underlying FG% for that game. The variance of a binomial variable is . The variance of the actual shooting percentage outcome is .
The leader in 2-point field goal attempts last season is LaMarcus Aldridge, who made 47.5% of his shots and took 18.45 attempts per game. If his underlying FG% was always 47.5%, the variance in his shooting percentage would be:
The standard deviation we would see over a season would be:
If instead of a steady 47.5% LaMarcus shoots either 10% above or below that number (57.5% in half his games and 37.5% in the other half), the variance would increase by and the standard deviation would increase from 11.7% to 15.3%. More reasonably, if he deviates from his season average FG% by 5%, the variance would increase by 0.0025 and the standard deviation by 1%, from 11.7% to 12.7%. That 5% game-to-game difference should be enough to create the 2% hot-hand improvement found by Bocskocsky et al.
11.7% standard deviation in game-to-game FG% isn’t a perfect estimate of the actual variability because the number of shots a player takes changes each time and that pushes the variance higher. However, if a players underlying percentage goes up and down by 5% we still expect see about a 1% increase in game-to-game standard deviation relative to the baseline case in which he enters each game with a constant underlying FG%. To figure out that baseline, I looked at the top 10 players from last season in 2-point attempts (2PA) and simulated each of their seasons 20 times. For each game, I kept the actual number of attempts fixed but generated a random number of makes using the player’s season-long 2-point shooting percentage (2P%). All the data are from the magnanimous treasure trove of Basketball-Reference.com.
For example, Aldridge made 8 out of 24 shots (33%) on the last game of the 2015 season. His season long 2P% was still 47.5%, so in my 20 simulations he hit 9, 4, 9, 10, 7, 8, 14, 12, 12, 9, 12, 9, 10, 5, 10, 8, 9, 11, 6 and 12 of his 24 shots. I took the game-to-game shooting percentage deviation in each simulated season and averaged these to get the baseline deviation. I compared this to the player’s actual game-to-game deviation, I looked for the latter to be around 1% for most players.
|Player||2P%||2PA||Baseline Deviation||Actual Deviation||Baseline – Actual||2PA-2P% correlation|
Shit, I really liked that theory.
Only 2 out of 10 players have actual game-to-game variance that’s significantly higher than the baseline, and 3 have a much lower one! Three explanations come to mind:
- I messed up the math or the simulation, you can spot the error and earn yourself a gift.
- Statistical coincidence, 10 players is a small sample, shit happens.
- Some mechanism is adjusting these players’ 2P% back to the mean within a game.
An example of #3 would be if players who start the game shooting well continue by taking more and worse shots, just like Bocskocsky saw happening after a streak. In fact, a high FG% game likely has streaks of makes after which the player will take bad shots and turn the high FG% game into an average FG% one. We can at least see evidence for these players shooting more often by looking at the correlation of their 2P% with attempts. Indeed, all three shoot more when they shoot well (right column).
Does that explanation sound plausible? That’s how bad science practice sounds like: alluring, seductive, and oh-so-reasonable. A post-hoc just-so story with little support in the data is still a crappy post-hoc just-so story if I came up with it myself. The bottom line is that I spent hours on that simulation and didn’t learn much of use, but I’ll be damned if I succumb to publication bias on my own blog.
Here’s what I learned from a week of digging into the dirt of hot hand research until my own hands got tired and bloody:
- Once we account for shot difficulty (or in cases like free throws where difficulty isn’t a thing), players shoot a bit (2%-4%) better after making a few shots in a row. Probably.
- Neither gamblers nor basketball fans are horribly confused by the “hot hand fallacy“, if they overestimate the chance of a successful streak continuing it’s not by much. Possibly.
- Science is hard. If you have a lot of analysis choices available, it’s very easy to let them lead you down a path of mirages. In the worst cases, your choice of analysis can lead you away from good conclusion (17% gap in Cornell!) and towards bad ones. Certainly.
Science – turns out it’s even harder than dating.