Magnus and the Turkey Grinder
With part II of our “When Data Serves Turkey” post
Baku Olympiad source—note similarity to this |
Magnus Carlsen last week retained his title of World Chess Champion. His match against challenger Sergey Karjakin had finished 6–6 after twelve games at “Standard” time controls, but he prevailed 3–1 in a four-game tiebreak series at “Rapid” time controls. Each game took an hour or hour-plus under a budget of 25 minutes plus 10 extra seconds for each move played.
Today we congratulate Carlsen and give the second half of our post on large data being anomalous.
According to my “Intrinsic Performance Ratings” (IPRs), Carlsen played the tiebreak games as trenchantly as he played the standard games. I measure his IPR for them at 2835, though with wider two-sigma error bars +- 250 than the 2835 +- 135 which I measured for the twelve standard games. Karjakin, however, played the rapid games at a clip of 2315 +- 340, significantly below his mark of 2890 +- 125 for the regular match. The combined mark was 2575 +- 215, against 2865 +- 90 for the match. It must be said that of course faster chess should register lower IPR values. My preliminary study of the famous Melody Amber tournaments, whose Rapid sections had closely similar time controls, finds an overall dropoff slightly over 200 Elo points. Thus the combined mark was close to the expected 2610 based on the average of Carlsen’s 2853 rating and Karjakin’s 2772. That Carlsen beat his 2650 expectation, modulo the error bars, remains the story.
Carlsen finished the last rapid game in style. See if you can find White’s winning move—which is in fact the only move that avoids losing:
The win that mattered most, though, was on Thanksgiving Day when Carlsen tied up the standard match 5–5 with a 75-move war of attrition. The ChessGames.com site has named it the “Turkey Grinder” game. On this note we resume talking about some bones to pick over “Big Data”—via large data taken using the University at Buffalo Center for Computational Research (CCR).
The Logistic Relationship
If you viewed the match on the official Agon match website, you saw a slider bar giving the probability for one side or the other to win. Or rather—since draws were factored in—the stands for the points expectation, which is the probability of winning plus half the probability of drawing. This is computed as a function of the value of the position from the player’s side. The beautiful fact—which we have discussed before in connection with a 2012 paper by Amir Ban—is that is an almost perfect logistic curve. Here is the plot for all available (AA) games at standard time controls in the years 2006–2015 with both players within 10 Elo points of the Elo 2000 level:
The “SF7d00” means that the chess program Stockfish 7 was run in Multi-PV mode to a variable depth between 20 and 30 ply. My scripts now balance the total number of positions searched so that endgame positions with fewer pieces are searched deeper. “LREG2” means the generalized logistic curve with two parameters. Using Wikipedia’s notation, I start with
and fix to symmetrize. Then is basically the chance of throwing away a completely winning game—and by symmetry, of winning a desperately lost game.
Chess programs—commonly called engines—output values in discrete units of called centipawns (cp). Internally they may have higher precision but their outputs under the standard UCI protocol are always whole numbers of cp which are converted to decimal for display. They often used to regard or as the value for checkmate, but has become standard. I still use as cutoffs and divide the -axis into “slots”
Positions of value beyond the cutoff belong to the end slots. Under a symmetry option, a position of value goes into both the slot for the player to move and the slot for the opponent. This is used to counteract the “drift” phenomenon discovered in this paper with my students that the player to move has a 2–3% lower expectation across all values—evidently because that player has the first opportunity to commit a game-chilling blunder.
The “b100” means that adjacent slots with fewer than 100 moves are grouped together into one “bucket” whose value is the weighted average of those slots. Larger slots are single buckets rather than divided into buckets of 100. The end slots and zero (when included) are single buckets regardless of size. Finally, the number after “sk” for “skedasticity” determines how buckets are weighted in the regression as I discuss further on.
The -value of a bucket is the sum of wins plus half of draws by the player enjoying the value (whose turn it might or might not be to move) divided by the size of the bucket. This is regressed to find most closely giving . The slope at zero is . The quantity
gives the expectation when ahead—figuratively the handicap at odds of a pawn. Note how close this is to 70% for players rated 2000.
The Law and the Scales
The fit is amazingly good—even after allowing that the value, so astronomically close to , is benefiting from the correlation between positions from the same game, many having similar values. Not only does it give the logistic relationship the status of a natural law (along lines we have discussed), but also Ban argues that chess programs must conform to it in order to maximize the predictive power of the values they output, which transmutes into playing strength. The robustness of this law is shown by this figure from the above-linked paper—being rated higher or lower than one’s opponent simply shifts the curve left or right:
This is one of several reasons why my main training set controls by limiting to games between evenly-rated players. (The plots are asymmetric in the tail because they grouped buckets from up to rather than come in from both ends as the present ones do.)
Most narrowly to our goal, the value determines the scale by which increases in value translate into greater expectation, more directly than quantities like or . Put simplistically, if a program values a queen at 10 rather than 9, one might expect its “” to adjust by 9/10. Early versions of Stockfish were notorious for their inflated scale. The goal is to put all chess programs on a common scale by mapping all their values to points expectations—and Ban’s dictum says this should be possible. By putting sundry versions of Stockfish and Komodo and Houdini (which placed 2nd to Stockfish in the just-concluded ninth TCEC championship) on the same scale as my earlier base program Rybka 3, I should be able to carry over my model’s trained equations to them in a simple and direct manner. Here is the plot for Komodo 10’s evaluations of the same 100,000+ game positions:
The fit is just as fine. The values are small and equal to within so they can be dismissed. The values are for Komodo against for Stockfish, giving a ratio of about . The evaluations for 70% expectation, for Komodo and for Stockfish, have almost the same ratio to three decimal places. So we should be able to multiply Komodo’s values by 1.046 and plug them into statistical tests derived using Stockfish, right?
The error bars of on Komodo’s , which are two-sigma (a little north of “95% confidence”), give some pause because they have wiggle. This may seem small, but recall the also-great fit of the linear regression from (scaled) player error to Elo rating in the previous post. Under that correspondence, 2% error translates to 2 Elo points for every 100 below perfection—call that 3400. For Carlsen and Karjakin flanking 2800 that means only Elo but grows to for 2000-level players. Here is a footnote on how the “bootstrap” results corroborate these error bars and another data pitfall they helped avoid.
But wait a second. This error-bar caveat is treating Komodo’s as independent from Stockfish’s . Surely they are completely systematically related. Thus one should just be able to plug one into the other with the conversion factor and get the same proportions everywhere, right? The data is huge and both the logistic and ASD-to-Elo regressions this touches on have and the force of natural law. At least the “wiggle” can’t possibly be worse than these error bars say, can it?
Revenge of the Turkeys
Here are side-by-side comparison graphs with Stockfish and Komodo on the same set of positions played by players within 10 Elo points of 1750.
Now the Komodo is lower. Here is a plot of the -values for Komodo and Stockfish over all rating levels, together with the Komodo/Stockfish ratio:
The ratio waddles between 0.96 and 1.06 with a quick jag back to parity for the 2700+ elite players. Uncertainty speaks a gap of 5 Elo points for every 100 under perfection, which makes a considerable 70–point difference for Elo-2000 players.
Well, we can try clumping the data into huger piles. I threw out data below 1600 and the 2800 endpoint—which has lots of Carlsen but currently excludes Karjakin since his 2772 is below 2780. I combined blocks of four levels at 1600–1750, 1800–1950, up to 2600–2750, and quadrupled the bucket size to match. Here is the plot for 2200–2350, with a move-weighted average of 2268:
With over 500,000 data points, mirrored to over a million, can one imagine a more perfect fit to a logistic curve? With Stockfish the value even prints as unity. And yet, this is arguably the worst offender in the plot of over these six piles:
The point for 2600–2750 goes down. It is plotted at 2645 since there are far many more 2600s than 2700s players, and it must be said that the 2400–2550 pile has its center 2488 north of 2475 because 2550 included all years whereas the 2000–2500 range starts in the year 2006. But the data point for 2200–2350 is smack in the middle of this range. Why is it so askew that neither regression line comes anywhere near the error bars for the data taken with the respective engine?
A Check on Just Gobbling Data
Getting a fixed value for the ratio is vital to putting engines on a common scale that works for all players. The above is anything but—and I haven’t even told what happens when Rybka and Houdini enter the picture. It feels like the engines diverge not based on their evaluation scales alone but on the differences in their values for inferior moves that human players tend to make, differences that per the part-I post correspond almost perfectly to rating. Given Amir Ban’s stated imperative to conform any program’s values to a logistic scale in order to maximize its playing strength, and the incredible fit of such a scale at all individual rating levels, how can this be?
I get similar wonkiness when I try to tune the ratio internally in my model, for instance to equalize IPRs produced with Komodo and Stockfish versions to those based on Rybka 3. There is also an imperative to corroborate results obtained via one engine in my cheating tests by executing the same process with test data from a different engine. This has been analogized to the ‘A’ and ‘B’ samples in doping tests for cycling, though those are taken at the same time and processed with the same “lab engine.”
I had hoped—indeed expected—that a stable conversion factor would enable the desirable goal of using the same model equations for both tests. I’ve become convinced this year that instead it will need voluminous separate training on separate data for each engine and engine version. A hint of why comes from just looking at the last pair of Komodo and Stockfish plots. All runs skip the bucket for an exact 0.00 value, which by symmetry always maps to 0.50. Its absence leaves a gap in Komodo’s plot, meaning that Komodo’s neighboring nonzero values carry more weight of imbalance in the players’ prospects than do 0.01 or -0.02 etc. coming from Stockfish. The data has 48,693 values of 0.00 given by Komodo 10 to only 43,176 given by Stockfish 7. Whereas, Komodo has only 42,350 values in the adjacent ranges -0.10 to -0.01 and +0.01 to +0.10 (before symmetrizing) to 47,768 by Stockfish. The divergence in plot results may be amplified by the “firewall at zero” phenomenon I observed last January. The logistic curves are dandy but don’t show the cardinalities of buckets, nor other higher-moment effects.
In the meantime I’ve been using conservative ratios for the other engines relative to Rybka. For example, my IPRs computed in such manner with Komodo 10 are:
- Regular match: Carlsen 2765 +- 170, Karjakin 2815 +- 165, combined 2790 +- 120.
- Tiebreaks: Carlsen 2675 +- 365, Karjakin 2245 +- 360, combined 2460 +- 260.
These are all 70–100 and so lower than the values I gave using Rybka. Critics of the regular match games in particular might agree more with these than my higher official numbers, but this needs to be said: When I computed the Rybka-based IPR for the aggregate of moves in all world championship matches since FIDE’s adoption of Elo ratings in 1971, and compared it with the move-weighted average of the Elo ratings of the players at the time of each match, the two figures agreed within 2 Elo points. Similarly weighting the IPRs for each match in my compendium gives almost the same accuracy.
That buttresses my particular model, but the present trouble happens before the data even gets to my model. Not even the scaling stage discussed in the last post is involved here. This throws up a raw existential question.
Much of data analytics is about “extracting the signal form the noise” when there is initially a lot of noise. Multiple layers of standard filters are applied to isolate phenomena. But here we are talking about raw data—no filters. All we have observed are the smooth linear correspondence between chess rating and average loss of position value and the even more perfect logistic relation between position value and win/draw/loss frequency. All we did was combine these two relations. The question is:
How did I manage to extract so much noise from such nearly-perfect signals?
Open Problems
Can you see an explanation for this wonkiness in my large data? What caveats for big-data analytics does it speak?
The chess answer is that Carlsen played 50.Qh6+!! and Karjakin instantly resigned, seeing Kxh6 51. Rh8 mate, and that after 50…gxh6 the other Rook drives home with 51. Rxf7 mate.
Update 12/11/16: Here is a note showing what happens when all drawn games are removed. The data point for 2200–2350 is even more rogue…
[fixed point placement in last figure, added “Baku Olympiad” to first caption, some word changes, added update and acknowledgment]
Deepwater small-amplitude ocean waves satisfy a very simple dispersion relation (dispersion relation specifies the functional dependence of wavelength upon frequency). This dispersion relation predicts (for example) that the group velocity of deepwater small-amplitude waves is one-half of the phase velocity.
But as the ocean gets shallower, the functional form of the ocean-wave dispersion relation changes greatly, such that in limit of water depth much less than wavelength, the phase velocity and the group velocity become equal (Wikipedia provides all the details, and Willard R. Bascombe’s classic textbook Waves and Beaches is a wonderful introduction to the practical implications).
By analogy, perhaps chess-play at levels below ~2600 (say) exhibits “deep-water” dynamics, that is, the level of foresight is sufficiently short-range that the distant boundary of perfect-play (at Elo~3400) does not enter into the decision-tree dynamics.
In contrast, it is reasonable to expect that chess at levels of Elo ~2800 and higher (say), will increasingly exhibit “shallow-water” chess dynamics, that is, the boundary of perfect chess-play will increasingly dominate decision-tree dynamics.
Thus chess below Elo ~2800 is plausibly an entirely different game from chess above it, such that outcome-curves that are simply and universally logistic for deepwater chess (Elo ≲ 2800) may diverge increasingly from logistic form at higher Elo values, as high-Elo chess-play starts to “feel the bottom” of the chess-ocean at Elo ~3400.
In summary, at the highest levels of contemporary chess, both human and machine, perhaps it’s no longer a near-perfect approximation to regard the ocean of chess-understanding as infinitely deep. The next 100-200 points of Elo-gain are going to be mighty interesting! 🙂
Thanks very much—this analogy may indeed operate with a “continental shelf” in the neighborhood of 2400. I have other possible indicators for this…more later if/when I have time. In general this is what I’ve been “afraid of”—a “Turkeyzilla.”
Just out of interest (not having a chess engine installed here) what’s the reply to Rg5?
then … Ra1+ (with mate to follow)
John and all, I have added a second footnote with what happens when all drawn games are removed. The wobbling in the ratio and weirdness of the 2200–2350 data point grow.
Why must an engien’s eval be logistic for it to play well? I don’t immediately see this claim in Ban’s 2012 paper.
Logistic evals are apparently baked into some engines during self-play tuning, which may be why the fit is so good. See for example
https://chessprogramming.wikispaces.com/Texel%27s+Tuning+Method
Thanks for the source—very interesting to me. It seems mostly to date to 2014 except for a 2009 reference that doesn’t say “logistic”, whereas Ban’s paper is 2012 and my own observations from late 2013 on began with Rybka 3 which dates to 2008. As for the claim, connect together:
(i) Ban’s questions “What makes an evaluation right? In what sense can it be wrong? Between two evaluation functions, how to judge which is better?” on page 2, followed by boldfaced claims to develop a testable theory to answer them that connects to MLE and logistic regression;
(ii) his logistic equations, followed by saying they constitute an algorithm to grade the correctness of other evaluation functions, and
(iii) the demonstrated success of his approach from Deep Junior’s WCCC wins.
All this is however modulo a technical sense in which the claim is false: You can post-process the evaluations of a position’s moves by any monotone function whatever and not change their ranking. This can even be done “for show” at intermediate stages of the search. It is still the case that move comparisons will be done by an internal evaluation to which Ban’s theory applies, but the values actually transmitted for self-play games can deviate from the logistic curve at will. The Houdini 5 site used to vaunt the “calibrated evaluations” of versions 3 and 4, which I deduced constituted this kind of post-processing from the logistic non-fit observed by Tamal Biswas and myself. To test the new one on the same data I may need to see if Wine can be configured stably for batch-mode jobs.
However again, you’ve prompted the idea that this kind of baking-in (via fast-chess games) may go with one human skill level and be askew for others.
That’s what I was thinking. I believe all top engines now use this tuning method or something very similar. Are you in touch with Kaufman? You should e-mail him. He’s interested in this kind of stuff. (Or consider posting to talkchess, where he and Houdart and lead Stockfish devs hang out)