Skip to content

The Entropy of Baseball

April 1, 2018

The most shocking existential fact about the universe?

Sports April Fools source

George Ruth Jr., the “Babe,” may have thought he had cosmic significance but no one knew it until now. He would have said it was all a joke anyway. He certainly loved pranks. As an April Fool’s joke during Florida spring training, he once let it be reported that he had slimmed down to 108 pounds and was beginning a new career as a jockey.

Today we report how the Babe—and every major-league player from David Aardsma and Henry Aaron to Edward Zwilling and Tony Zych—helped uncover a fact about the universe.

This came to light because our correspondents Faadosly and Lofa Polir found a scientific job that suits their talents. They both were hired to the blind injection team at the Laser Interferometer Gravitational-Wave Observatory (LIGO) installation near Livingston, Louisiana.

Blind injection is a protocol whereby a false signal is superposed on the data taken by the main apparatus to test how the main scientific team reacts to it. It is April Fools but with serious intent. As this article noted about the first such trial:

The envelope opened in March 2011 to reveal a fake. The good news was that the team correctly identified the signal. The better news was [that two discrepancies from] what the injection team had expected them to see … turned out to be mistakes by the blind injection team themselves, revealed by the sweat of the [main] LIGO team!

Further successes through 2016 led to implementing the opposite kind of trial where failure would count as success. But it succeeded. Let’s describe further what was involved.

Signals to Treat as Noise

LIGO makes inferences about cosmic events from fluctuations on the tiniest of scales, {10^{-18}} meters, a nano-nanometer. It is subject to random fluctuations of sources from quantum to cosmic. The game is to tell specific events apart from the random background.

Many specific kinds of events are known. Seismic activity is subtracted out by an isolation mechanism that compares absolute and relative motions. Cargo trains that rumble at known times twice daily 7km from the Livingston detector have blunted it enough to count as downtime.

Most local disturbances, however, have been simply identified and discounted by the fact of having two LIGO detectors, the other near Hanford in the state of Washington. Others will open worldwide. True cosmic events will register at both (or all) detectors at precisely known relative times. Signals that show up at only one can be subtracted out.

Of course the random events at both LIGOs differ between them. Originally it was not considered terrible to subtract out one detector’s random noise from the other’s, basically because

\displaystyle  \mathit{random~XOR~anything = random}.

Yet as an A+ version of LIGO is nearing deployment, it became exigent to test the boundary between systematic and random discrepancies.

The idea was to insert a signal of systematic origin that behaves like random noise—or so we believe. The outputs of strong pseudorandom generators were considered but rejected as artificial. This is when the Polirs suggested a source of hallowed significance in quantum physics.

The Inside Baseball

Stephen Hawking famously conceded his loss of a bet to John Preskill over the black hole information paradox. Preskill’s prize in 2005 was a copy of Total Baseball: The Ultimate Baseball Encyclopedia. As the Amazon blurb for the current edition states:

About half of the volume is made up of detailed statistics for every player ever to appear in a major league game. Other statistical sections, including records, awards, and MVP and Hall of Fame voting results, help round out this tribute to the statistical minutiae that fascinates many baseball fans.

The edition exists in digital form. As a tribute to Hawking, the blind-injection team agreed to use this as the signal.

Some initial processing was done on the data. Symbolic categories like hit, homer, strikeout, and walk were converted to digital form. Obvious redundancies such as each event’s contribution to pitching and hitting stats were subtracted out. Known biases such as Benford’s Law were removed by non-lossy transforms. Not just the 2,300 dense pages of statistics in the print edition, but also the acres of raw recorded in-game data from which they were compiled in later seasons, were refined into a 6.4 gigabytes stream {B} that was believed indistinguishable from white noise.

The final key was that what they fed as {L} to Livingston was not {B}, but rather the item-scale difference—essentially the XOR—of {B} with the signal {H} detected milliseconds earlier at Hanford. This was done on March 17. The blind team had control of the system clock so that the scientific teams attached to both places would not know of the delay. Hence the teams’ own differencing would give not random but rather {B} back again.

What the ‘Sweat’ Revealed

It should be understood that much of the “sweat” quoted above came not from the human team members but from supercomputing of incredibly high bandwidth using a sever farm in Quincy to the north of Hanford. The new massively parallel deep neural rules-based classification algorithm by Xiaowei Gu et al. was used to build generative models for all of the Livingston stream {L}, the Hanford stream {H}, and their item-scale difference which was {B}. This was the shock:

The generative model for {B} filled only 300 megabytes.

Not only that, the model was stratified so that, for example, the records for 2017 were compressed under 4.5MB. The upshot is:

There is a file {B_0} smaller than a medium-quality JPEG photo from which the entire recorded events of the 2017 Major League season can be recovered verbatim with shallow post-processing (mainly applying the model rules that were found and then undoing the transforms mentioned above).

Interpretation and Speculation

We can compare this with the stunning main result of the recent AlphaZero paper as noted in comments to our post on it and summarized here:

There is a file {A_0} smaller than 300MB such that relatively shallow processing with access to {A_0} trounces the deep search of the world’s best chess programs, with results that appear to border on perfect play.

To focus our comparison, note that perfect play is known for all positions with 7 or fewer pieces in files that fill over 100 terabytes at the Lomonosov Moscow State University computer center. Those files {L} can be generated from a very tiny file {L_0}, one that merely specifies the rules of chess and the allowed contents of the board. However, the file {L_0} cannot be efficiently consulted—the very creation of {L} from {L_0} characterizes the notion of computational depth. The shock of AlphaZero is that the entire game of chess has been compressed into a small file that is also consultable.

What doubles the shock in baseball is that unlike with chess its outcomes are not rule-based. Once the batter hits the ball much of what happens is ascribed to luck. Per discussion in our previous post, no convincing evidence of a “hot hand” in baseball is extant, nothing to distinguish results of at-bats from rolling dice.

Yet the outcomes in the tables of each game over the entire season were revealed as the consequents of relatively simple rules. It is as if some cosmic power decreed:

The first pitch of the season shall be a home run. A light-hitting infielder shall hit home runs as the only scores in back-to-back 1-0 wins.

By the correspondence of information complexity to entropy, there is less entropy on specifying the first pitch than the 107,348-th as yielding a homer, or in making a team’s second game a Xerox of its first. This analogy is imperfect—of course homers will be hit on other pitches and there were other differences in the Giants’ two 1-0 wins. But it is enough to illustrate how rules can be simpler than listing random events. The point is that like the particular neural weights of AlphaZero after its training, the values and implied rules may be subtle, unexplainable, and felt only by their effects. But they are present and they are short.

Both AlphaZero’s success and the new discovery about baseball rest on the great broken syllogism of physics that flows from Hawking’s concession:

  1. Information can be neither created nor destroyed.

  2. The universe began in a state with initial conditions of low complexity.

  3. We currently observe high complexity pumped up by quantum noise.

  4. ???

The likeliest resolution may pin the discrepancy on a local/global difference. The Mandelbrot set is specified on the whole by a tiny equation, but individual vantage points on it can have high complexity. Quantum outcomes have with overwhelming likelihood positioned our local world at such a point.

Yet the efficiency of randomness generation by this process need not be 1:1. The new results from the LIGO supercomputers on the entropy of baseball suggest an upper bound of 1:20 or so. It is as if every Babe Ruth we see is really slimmed down to a jockey’s weight.

Open Problems

In all seriousness, how much entropy could we extract from the statistics of a baseball season?

We are grateful to the Polirs and their superiors for permission to report on these developments in advance of normal scientific review.

6 Comments leave one →
  1. April 1, 2018 11:54 pm

    One point I’m aware of is that a 1:20 randomness conversion in this context is not a 1-in-2^20 unlikelihood but a sicko-insane unlikelihood—not to mention probably a violation of special relativity. As for data points, while this was being composed there was a Xeroxing of wins by Notre Dame’s women’s basketball team, but this was likely due not to low entropy but rather high leprechaunality.

  2. April 3, 2018 4:13 am

    Has B ever been subjected to statistical randomness tests? It would be interesting to see the results. 🙂

    • Andrew Wilson permalink
      April 4, 2018 1:44 am

      Prof. I. Lirpa reported a truly remarkable result regarding the Kolmogorov complexity of baseball stats; unfortunately, the margins in his copy of “Moneyball” were too small to contain the proof.

  3. jheisenberg permalink
    April 3, 2018 8:02 am

    Currently you can follow a remake of AlphaZero at The program shows great positional play (for its current Elo of around 2000, 4/1/18) but very poor endgame skills, not surprisingly, since it started training at the beginning.

  4. Andrew Wilson permalink
    April 5, 2018 2:40 am

    Reluctantly setting the April foolery aside for another year, is the answer to “the great broken syllogism of physics” resuscitating the Ed Fredkin/Stephen Wolfram concept of the universe as cellular automaton? Minimal information content per cell plus untold zillions of iterations produce the apparent complexity of our universe. {Mandelbrot-esque, yes.} Wolfram’s “New Kind of Science” took heavy fire when published, but, perhaps times have changed?

  5. April 7, 2018 2:18 pm

    [In reply to AW] This paper by Max Tegmark comes closest to expressing this from the CS vantage, IMHO. My own thoughts draw on analogy to converting an NFA N into a DFA M that is often exponentially larger. Under the analogy, Schro”dinger’s equation describes M while what we experience is a “random” path through N. I’m not convinced that physical realization of N entails the same for M. But the path through N accrues randomness which like “manna” becomes disposable.

    [In reply to JH] Thanks. I have been following LCZero just as far as threads at

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s