A simple idea that everyone missed, and more?

 Composite of src1, src2, src3 —A myth of a myth of a myth?

Joshua Miller and Adam Sanjurjo (MS) have made a simple yet striking insight about the so-called hot hand fallacy.

Today Ken and I want to discuss their insight, suggest an alternate fix, and reflect on what it means for research more broadly.

I believe what is so important about their result is: It uses the most elementary of arguments from probability theory, yet seems to shed new light on a well-studied, much-argued problem. This achievement is wonderful, and gives me hope that there are still simple insights out there, simple insights that we all have missed—yet insights that could dramatically change the way we look at some of our most important open problems.

I would like to point out that I first saw their paper mentioned in the technical journal WSJ—the Wall Street Journal.

## The Fallacy

Our friends at Wikipedia make it very clear that they believe there is no hot-hand at all:

The hot-hand fallacy (also known as the “hot hand phenomenon” or “hot hand”) is the fallacious belief that a person who has experienced success with a random event has a greater chance of further success in additional attempts. The concept has been applied to gambling and sports, such as basketball.

The fallacy started as a serious study back in 1985, when Thomas Gilovich, Amos Tversky, and Robert Vallone (GTV) published a paper titled “The Hot Hand in Basketball: On the Misperception of Random Sequences” (see also this followup). In many sports, especially basketball, players often seem to get hot. For example, an NBA player who normally usually shoots under 50% from the field may make five shots in a row in a short time. This suggests that he is “in the zone” or has a “hot hand.” His teammates may pass him the ball more often in expectation of his staying “hot.”

Yet GTV and many subsequent studies say that this is wrong. They seem to show rather that shooting is essentially a series of independent events: that each shot is independent from previous ones, and that this is equally true of free throws. Of course given the nature of this problem, there is no way to prove mathematically that hot hands do not exist. There are now many papers, websites, and a continuing debate about whether hot hands are real or not.

## The Fallacy Of The Fallacy

Let me start by saying that Ken and I are mostly on the fence about the fallacy. I can imagine many situations that make it mathematically possible. For one, what if a player in basketball is not doing independent trials but rather executing a Markov process. That is, of course, a fancy way of saying that the player has “state.” Over a long term they will make about ${75\%}$ of their free-throws, but there are times when they will shoot at a much higher percentage. And, of course, times when they will shoot at a much lower percentage—a “cold hand” is more obvious with free throws and may lead the other team to choose to foul that player.

I personally once experienced something that seems to suggest that there is a hot hand. I was a poor player of stick-ball when I was young, but one day played like the best in the neighborhood. Oh well. Let me explain this in more detail another day.

Ken chimes in: This folds into a larger question that is not fallacious, and that concerns me as I monitor chess tournaments:

What is the time or game unit of being in form?

In chess I think the tournament more than game or move is the “unit of form” though I am still a long way from rigorously testing this. In baseball the saying is, basically, “Momentum stops with tomorrow’s opposing starting pitcher.”

Nevertheless, I (Ken) wonder how far “hot hand” has been tested in fantasy baseball. In my leagues on Yahoo! one can exchange players at any time. Last month I dropped the slumping Carlos Gomez in two leagues even though Yahoo! still ranked him the 8th-best player by potential. In one league I picked up Jake Marisnick, another Houston outfielder who had been hot. Marisnick soon went cold with the bat but stole 6 bases to stay within the top 100 performers anyway. The leagues ended with the regular season, but had they continued into the playoffs I definitely would have picked up Houston’s Colby Rasmus after 3 homers in 3 games. While I’ve been writing the last three main sections of this post, Rasmus hit another homer today.

## The New Insight

MS have a new paper with the long title, “Surprised by the Gambler’s and Hot Hand Fallacies? A Truth in the Law of Small Numbers.” Their paper is interesting enough that it has already stimulated another paper by Yosef Rinott and Maya Bar-Hillel, who do a great job of explaining what is going on. The statistician and social science author Andrew Gelman explained the basic point in even simpler terms on his blog in July, and we will follow his lead.

For the simplest case consider sequences of ${n}$ coin tosses. We will generate lots of these sequences at random. We will look at times the coin comes up heads (${H}$) and say it is “hot” if the next throw is also ${H}$. If the sequence is all tails, however:

$\displaystyle TT \dots TT,$

then we must discard it since there is no ${H}$ to look at. This is our first hint of a lurking bias. Likewise if only the last throw is heads,

$\displaystyle TT\dots TH,$

there is no chance for a hot hand since the “game is over.” Hence it seems natural to use the following sampling procedure:

1. Generate a sequence ${s}$ at random.

2. If it has no head in the first ${n-1}$ places, discard it.

3. Else pick a head at random from those places. That is, pick ${i \leq n-1}$ such that ${s_i = H}$.

4. Score a “hit” if ${s_{i+1} = H}$, else “miss.”

Let HHS be the number of hits after ${M}$ trials—not counting discarded ones—divided by ${M}$. That is our “hot hand score.” We expect HHS to converge to ${0.5}$ as ${M}$ gets large. Indeed, there seems a simple way to prove it: Consider any head ${H}$ that we chose in step ${3}$. The next flip ${s_{i+1}}$ is equally likely to be ${H}$ or ${T}$. If it is ${H}$ we score a hit, else a miss. Hence our expected hit ratio will be 50%. Q.E.D. And wrong.

To see why, consider ${n = 3}$. The sequences ${TTT}$ and ${TTH}$ are discarded, so we can skip them. Here are the other six sequences and their expected contribution to our hit score. Remember only a head in the first two places is selected.

$\displaystyle \begin{array}{c|c|c} Seq. & Prob. & Score\\ \hline THT & 0/1 & 0\\ THH & 1/1 & 1\\ HTT & 0/1 & 0\\ HTH & 0/1 & 0\\ HHT & 1/2 & 0.5\\ HHH & 2/2 & 1\\ \hline \end{array}$

This gives HHS ${= 2.5/6 \approx 0.417}$, not ${0.5}$. What happened?

## A Statistical Heads-Up

What happened is that the heads are not truly chosen uniformly at random. Look again at the six rows. There are 8 heads total in the first two characters. Choose one of them at random over the whole data set. Then you see 4 heads are followed by ${H}$ and the other 4 by ${T}$. So we get the expected 50%. The flaw in our previous sampling is that it was shortchanging the two heads-rich sequences, weighting each ${1/6}$ instead of ${1/4}$. This is the bias.

You might think the bias would quickly lessen as the sequence length grows but that too is wrong. Try ${n = 4}$. You get the weird fraction ${\frac{17}{42}}$. This equals ${0.405\dots}$, which has gone down. For ${n = 5}$ we get ${\frac{147}{360} = 0.408\dots}$. As ${n \rightarrow \infty}$ the expectation does come back up to ${1/2}$ but for ${n = 20}$, which is a typical high-end for shots by a player in an NBA game, it is still under ${0.474}$. Rinott and Bar-Hillel derive the ${k = 1}$ case with probabilities ${p}$ of heads and ${q = 1-p}$ of tails and ${m = n-1}$ as

$\displaystyle \frac{p}{m} + \frac{p}{1-q^{m}} - \frac{1}{m} = \frac{p}{1-q^{m}} - \frac{1-p}{m} = \frac{1-q}{1-q^{m}} - \frac{q}{m}.$

The bias persists if we condition on ${k}$ heads in a row. “Appendix B” in the MS paper attempts to find a formula like that above for ${k > 1}$ but has to settle on partial results that imply growth with ${k}$ among other things. We can give some concrete numbers: For ${k = 2}$ and ${n = 4}$ the chance of seeing heads next is the same ${2.5/6 \approx 0.417}$. For ${k=2}$ and ${n=6}$ it has dipped to ${3/8 = 0.375}$ with 17 of 32 sequences being thrown out. The larger point, however, is that these numbers represent the expectation for any one sequence ${s}$ that you generate, provided you follow the policy of selecting at random from all the ${H^k}$ subsequences in ${s}$ through place ${n-1}$ and predicate on the next place being ${H}$.

Did GTV really follow this sampling strategy? It seems yes. It comes about if you follow the letter of conditioning on the event that the previous ${k}$ flips were heads. Rinott and Bar-Hillel quote words to that effect from the GTV paper, ones also saying they compared the probability conditioned on the last ${k}$ shots being misses. The gist of what they and MS are saying is:

If you did your hot-hand study as above and got 50%, then chances are the process from which you drew the data really was “hot.”

And since the way of conditioning on misses has a bias toward ${0.6}$, if you got ${0.5}$ again then your data might really also show a “Cold Hand.” It should be noted, however, that the GTV study of free throws is not affected by this issue and indeed already abides by our ‘grid’ suggestion below—see also the discussion by MS beginning at page 10 here.

## Subtler Bias Still

MS give some other explanations of the bias, one roughly as follows: If a sequence starts with ${k}$ heads and is followed by ${T}$, then you had no choice but to select those ${k}$ heads. But if they are followed by another ${H}$, then we do have a choice: we could select the latter ${k}$ heads instead. This has a chance of failure, so the positive case is not as rich as the negative one. Getting positive results—more heads—also gives more ways to do selections that yield negative results.

This suggests a fix, but the fix doesn’t work: Let us only condition on sequences of ${k}$ heads that come after a tail or start at the beginning of the game. The above issue goes away, but now there is a bias in the other direction. It shows first with ${k = 1}$ and ${n = 4}$; we have grouped sequences differing on the last bit:

$\displaystyle \begin{array}{c|c|c} Seq. & Prob. & Score\\ \hline TTTT,\;TTTH & - & -\\ TTHT,\;TTHH & 1/2 & 0.5\\ THTT,\;THTH & 0/2 & 0\\ THHT, \; THHH & 2/2 & 1\\ HTTT, \; HTTH & 0/2 & 0\\ HTHT, \; HTHH & 1/4 & 0.25\\ HHTT, \; HHTH & 2/2 & 1\\ HHHT, \; HHHH & 2/2 & 1\\ \hline \end{array}$

This gives ${3.75/7 \approx 0.536}$. The main culprit again is the selection within the sequence ${HTH*}$, with the discarding of ${TTT*}$ also a factor.

## A Grid Idea

This train of thought led me (Ken) to an ironclad fix, different from a test combining conditionals by MS which they refer to their 2014 working paper. It simply allows only one selection from each non-discarded sequence, namely, testing the bit after the first ${k}$ consecutive heads. The line ${HTH*}$ then gives ${0/2}$ and makes the whole table ${3.5/7 = 0.5}$ overall.

That this is unbiased is easy to prove: Given ${s}$ in the set ${S}$ of non-discarded sequences, define ${s'}$ to be the sequence obtained by flipping the bit after the first ${k}$ consecutive heads. This is invertible—${(s')' = s}$ back again—and flips the result. Hence the next-bit expectation is ${0.5}$. This works similarly for any probability ${p}$ of heads and does not care whether the overall sequence length ${n}$ is kept constant between samples. Hence our proposal is:

Break up the data into subsequences ${s_j}$ of any desired lengths ${n_j}$. The ${n_j}$ may depend on the lengths of the given sequences (e.g., how many shots each player surveyed took) but of course not on the data bits themselves. For every ${s_j}$ that has a run of ${k}$ consecutive “hits” in the first ${n_j - 1}$ places, choose the first such sequence and test whether the next bit is “hit” or “miss.”

The downsides to putting this kind of grid on the data are sacrificing some samples—that is, discarding cases of ${H^k}$ that cross gridlines or come second in an ${s_j}$—and the arbitrary nature of choosing rules for ${n_j}$. But it eliminates the bias. We can also re-sample by randomly choosing other grid divisions. The samples would be correlated but estimation of the means would be valid, and now every sequence of ${H^k}$ might be used as a condition. The re-sampling fixes the original bias in choosing every such sequence singly as the condition.

A larger lesson is that hidden bias might be rooted out by clarifying the algorithms by which data is collated. Researchers count as part of the public addressed in articles on how not to go astray such as this and this, both of which highlight pitfalls of conditional probability.

## Open Problems

Has our grid idea been tried? Does it show a “hot hand”?

Update (10/18): The idea is apparently new, though as we admit in the post it is “lossy.” Meanwhile this has been covered in the Review section of the Sunday 10/18 New York Times, which in turn references and gives a chart from this post by Steven Landsburg on his “Big Questions” blog.

49 Comments leave one →
October 13, 2015 10:28 am

I think some of the subtle bias problem stems from the “hot hand” focus and could be avoided by posing the question as whether hot and cold “streaks” actually exist. Then it’s a simple autocorrelation problem, e.g., as a test of whether the sequence obtained from the taking the product of consecutive +1/-1 pairs tends toward zero mean.

2. October 13, 2015 1:12 pm

Wow! I just heard about it today from Yosi Rinot (mentioned above), whom I met coincidentally in a Tel Aviv street, and planned to write about it before seeing this post…

3. October 14, 2015 1:19 am

This was already commented on two blog entries by Andrew Gelman, here and there.

• October 14, 2015 9:43 am

The former post by Gelman is referenced prominently in our post already. I did not see the latter from Sept. 30—these words are notable:

P.S. Miller points out that, for real shooting data (as opposed to coin flips) there is no simple weighted averaging that would give you the correct hot-hand estimate, as such an average would not correct for differences between players. That’s why I think the ultimate way to go will be to fit a Bayesian analysis using Stan. We’ve done some steps toward this but our model is still in a simple and preliminary stage.

Nevertheless, I wonder if the grid idea which is provably right for coinflips still helps, or (perhaps along with re-sampling) just has the same effect as the Bayesian procedure.

4. October 14, 2015 7:38 pm

I appreciate this work. Minutia: I believe “shortchanging the two heads-rich sequences” should say “shortchanging the two-heads-rich sequences” — i.e., there are not just two of them.

• October 15, 2015 9:51 am

We appreciate the scrutiny too. I (Ken) did mean the two sequences HHT and HHH in the table which got weighted 1/6; the sequences HTH and THH don’t “count” because the second H in the 3rd position does not affect the selection.

I had a similar thought recently about a newspaper article on the ABC conjecture that stated it “roughly” as “if a + b = c and (…) then c is divisible by only a few, large primes.” I first read it without the comma, but actually the comma is correct.

5. Udi Wieder permalink
October 15, 2015 2:35 pm

Statistics aside, from a sports analysis point of view these studies should be taken with a huge grain of salt. In basketball, the belief in the hot hand streak affects both the shot selection of the player and the defensive schemes of her opponents. The incentive structure is complex and can run both ways. Personally I think elite athletes are exceptionally attuned to their bodies and their performance. If they believe the phenomena exists I believe it too.

6. October 15, 2015 9:31 pm

It is one thing to discuss these concepts with dice rolls and slot machines, which produce largely random outcomes, assuming the dice are rolled in a “thorough” manner, and the slots are not fixed; it is very different, and probably innappropriate to apply them to subsequent acts in a game of skill, where so many factors ARE in the control of the performer. A ball-player really CAN ‘get hot’ – perhaps they finally are getting that sleep they need, or have a new practice regimen that is more effective, or have started thinking about the act of shooting in a different way… point is, a skilled player shooting a basketball is a *radically* different class of activity than a random dice roll.

7. October 15, 2015 11:06 pm

There’s what is surely a related superstition in pinball. Many pinball leagues run by match play, where groups of four people take turns, each picking a table, and gaining league points based on who scores highest, second highest, et cetera.

Anyway, there’s a folkloric belief that the pinball table you pick will betray you, and you’ll do terribly on it.

Someone gathered the data for the Grand Rapids (Michigan) Pinball League, and found that the table a player picks gave them a first-place finish … nearly a quarter of the time. But they did finish third or fourth slightly more than half the time. It’s probably not an amount of statistical significance, but it does suggest the folklore is at least not perfectly unfounded. It just mostly is.

• October 15, 2015 11:08 pm

(I should clarify: this sort of league usually pits players against others of roughly equal ability. The points players score are used to move them up or down in the rankings, and groups are organized by that ranking. So you should expect to come in first about a quarter of the time, second a quarter of the time, et cetera.)

• October 15, 2015 11:46 pm

Replying to several above: JKU makes an excellent point—there are studies of streaks including a presentation by the venerable Jim Albert which I heard at the 2013 NeSSiS (New England Symposium on Statistics in Sports). Also there was the noted paper by Andrew Bocskocsky, John Ezekowitz, and Carolyn Stein of Harvard on how basketballers who are “hot” are observed to take more difficult shots. I left these on the cutting-room floor to keep the length down, also figuring I’d touched such sports-specific matters by my insert about the “game unit of form” into what Dick wrote. Also left out was a study of “hot hand” in chess by Alexander Matros and Irina Murtazashvili which I think is still being completed, maybe awaiting some belated input by me…?

8. October 18, 2015 12:50 pm

Nice work. I have long been skeptical of the standard “hot hand fallacy” consensus. I feel like it’s easier to see in the inverse: “Do cold hands exist?”. Certainly. When I came down with pneumonia in high school my basketball play was completely ruined, of course; an extreme example.

Another point of bias/filtering is that the coaches and other players on the court are assessing and adjusting for this at all times. A truly cold (sick?) player won’t get put in; a hot player will get the ball more often, and a cold player less often (by teammates; perhaps the reverse for opponents). So likely the apparent shooting rates are higher than they would be in a theoretical everyone-takes-a-shot-once-per-minute distribution.

I wonder if this has implications to the similar “no fund manager is really better than any other, it’s just misinterpretations of random sequences” consensus (particularly as a selling point for index funds). I’m likewise skeptical of that idea being completely true.

9. October 21, 2015 10:06 pm

Forget Colby Rasmus. After just now watching NYM 2B Daniel Murphy hit his 6th homer in 6 postseason games, the inverse of the hot-hand “fallacy” should be renamed “Murphy’s WAR.”

[Delta—yes, what you say goes under my umbrella intent of the “unit of form”.]

10. George Watson permalink
January 13, 2016 5:51 pm

The problem with looking at the “Hot Hand” is that shooting a basketball is not a random event.
A very tall Basketball player could always and only “shoot” when he is near enough the basket to make it almost a surety that the ball goes in. [ No Dunking ]

When you flip a coin you either get Heads or Tails, what you don’t get is a near miss –
which should count for something when a person has a “Hot Hand” in basketball.

By only counting success or failure instead of near success…massive failure you discount
the non-randomness of shooting a basketball and create a bias that mimics flipping a coin.