A Chess Firewall at Zero?
We halve our blunder rate when infinitesimally ahead, but why?
Crop from Seneca chess quote source |
Lucius Seneca was a Roman playwright, philosopher, and statesman of the first century. He is called “the Younger” because his father Marcus Seneca was also a famous writer. His elder brother Lucius Gallio appears as a judge in the Book of Acts. Beside many quotations from his work, Seneca is famous for one he is not known to have said:
“To err is human.”
Lest we cluck at human error in pinning down ancient quotations, the source for the following updated version is also unknown—even with our legions of computers by which to track it:
“To err is human, but to really screw things up requires a computer.”
Today I report a phenomenon about human error that is magnified by today’s computers’ deeper search, and that I believe arises from their interaction with complexity properties of chess.
I have previously reported some phenomena that my student Tamal Biswas and I believe owe primarily to human psychology. This one I believe is different—but I isolated it in full only a week ago so who knows. It is that the proportion of large errors by human players in positions where computers judge them to be a tiny fraction of a pawn ahead is under half the rate in positions where the player is judged ever so slightly behind.
The full version of what Seneca actually wrote—or perhaps didn’t write—is even more interesting in the original Latin:
Errare humanum est, perseverare autem diabolicum, et tertia non datur.
This means: “To err is human; to persevere in error is of the devil; and no third possibility is granted.” The phrase tertium non datur is used for the Law of Excluded Middle in logic. In logic the law says that either Seneca wrote the line or he didn’t, with no third possibility. We will say more about this law in an upcoming post. Amid disputes about whether human behavior and its measurement follow primarily “rational” or “psychological” lines, we open a third possibility: “complexitarian.”
State of Computer Chess
Here are some important things to know about chess and computer chess programs (called “engines”).
- Chess is hard. Finding a best move in a given position or even telling if you’re winning is a concrete case of a -hard problem.
- Computers can now slaughter the best human players on even terms using commodity hardware; the champion Komodo program recently gave substantial handicaps to US champion Hikaru Nakamura and still won.
- All leading programs work in progressively deeper rounds of search. Under the common UCI protocol they can be configured to compute a value for each possible move at each depth of search, but in the usual “Single-Line” playing mode they save time by only bounding inferior moves away from the value of their current best move.
- Engines often change their “mind” about the values of moves and which one to rank first, but their verdicts become more stable as the search deepens. The values are commonly measured in units of 0.01 called centipawns—figuratively hundredths of a pawn.
- Search depth is measured in plies meaning moves by White or Black, so depth 12 means looking ahead 6 moves for both. Programs have a basic search depth that notches up but can extend their search at any time when the critical nature of the moves warrants doing so.
- Basic depth 12 was once projected to subdue the human champion; I suspect it’s really depth 18 with today’s programs but no matter: they hit 18 in seconds and reach the high 20s and beyond in games at standard time controls. We recently discussed how closely they are knocking on perfection.
- Perfection in chess is probably a drawn outcome, which gets the value 0.00 in centipawns for both players. One way 0.00 values occur during a search is when both players must repeat the same sequence of moves—where any deviation gives that player a negative centipawn value according to the engine. The tendency of engines to give 0.00 at higher depths in positions that look crazy-complicated to human players has been remarked more and more in game commentary by grandmasters in magazines such as New in Chess.
One upshot is that depth of cogitation is solidly quantifiable in the chess setting. We have previously posted about our papers giving evidence of its connection to human thinking and error. The new phenomenon leans on this connection but we will argue that it has a different explanation.
The Phenomenon
My training sets include all recorded games in the years 2010–2014 between players rated within 10 points of the same century or half-century milepost of the Elo rating system. They range all the way from Elo 1050 to Elo 2800+, that is, from beginning adult tournament-level players to the human world champion and his closest challengers. This “milepost set” has over 1.15 million positions from 18,702 games, counting occurrences of the same position in different games but skipping the first 8 turns for both players. Each position was analyzed to depth at least 19 by both the newly-released version 7 of the number-two ranked Stockfish and last month’s version 9.3 of Komodo, using a multitude of single threads (for reproducibility) on the supercomputing cluster of the University at Buffalo Center for Computational Research.
The value of a position is the same as the value of the best move(s) in the position. In multi-line mode we can get the value of the played move directly, while in single-line mode we can if needed take the value of the next position. A value of +1.50 or more is commonly labeled a “decisive advantage” by chess software (though there are many exceptions), whereas values between -0.30 and +0.30 are considered grades of equality—at least by some chess software. Accordingly let’s call any move that drops the value by 1.50 or more a blunder. We will tabulate ranges in -0.40…+0.40 where a blunder most matters. Value exactly 0.00 gets a range to itself.
The final wrinkle is that we will use the engines’ highest-depth values to group the positions, but distinguish cases where the human player’s move was regarded as a blunder at depth 5, 10, 15, and 20. The last includes some depth-19 values. Doing so distinguishes immediate blunders like hanging your queen from subtle mistakes that require a lot of depth to expose. The data divides fairly neatly into thirds, one for “amateur” players under 2000 (387,024 positions from 6,187 games), then the 2000–2250 range (394,186 from 6,380), then 2300 and above (372,587 from 6,035). The depth-5 numbers for Stockfish 7 are “off” for reasons I demonstrated in a chess forum post; responses seem to confirm a recent bug that has lessened impact for depths 9 and higher. (See Update below.) The tables are for single-line mode (the multi-line mode results are similar including the Stockfish anomaly), and positions are analyzed sans game history so that only repetitions downstream in the search are detected.
The point is not that weaker players make more large mistakes, but rather the “Red Sea” effect at 0.00. It becomes more pronounced at the greatest depths and for the strongest players. Here are the figures for players rated 2600 and above, from 90,045 positions in 1,335 games. They are tabled in smaller intervals of 0.10 within the “equality” range -0.30 to +0.30.
Although Stockfish and Komodo have differences in their evaluation scales—happily less pronounced than they were 1 and 2 years ago—they agree that the world’s elite made six times more large errors when on the lower side of equality. This begs asking not only why but whether it is a real human phenomenon.
What Can Explain It?
There seems no possible rational explanation—indeed the sharp change at 0.00 counters the rational hypothesis discussed in this paper and toward the end of my 2014 TEDxBuffalo talk. But it is also hard to sustain a psychological explanation. If being slightly behind puts one off one’s game, why is the ratio accentuated for the top players? That this training set includes only games in which the players are evenly matched minimizes any variance from overconfidence or “fear factors.”
My proposed explanation leans on the above-mentioned engine behavior with repetitions and 0.00 values. Spot-checks affirm that the wide 0.00 bin includes many positions where the armies trade blows but fall into a repeating sequence from which neither can veer without cost. If the machine judges your position worth +0.01 or more then it places you discretely above this no-man’s zone. Any notionally big slip you make still has a good chance of being caught in this large attraction basin and hence being charged as only a small error by the computer. Whereas a slip from 0.00 or below has no safety net and falls right down.
This hypothesis comes with one large falsifiable prediction and perhaps a few others regarding games played by computers and computer-human teams as opposed to humans alone. My entire model uses no feature of chess apart from the move values supplied by engines at progressively increasing depths of search. It thus transfers to any alternating-decision game in which sufficiently authoritative move values can be computed. Some games like Go rule out repeating sequences, while others like Shogi allow them but rarely end in draws.
For any strategy game that disallows repetitions and/or has negligible draw frequency, there will be no similar “firewall at 0.00” effect on the rate of human mistakes.
Go and Shogi are just as hard as chess, so this hypothesis does not pertain to worst-case complexity. Rather it addresses heuristically complex “most-case” behavior. It also highlights a general risk factor when using computers as judges: notional human errors are necessarily being evaluated by machines but the machines are evidently not doing so with human-relevant accuracy.
How to Handle It?
This last issue is not theoretical or didactic for me. In applying my statistical model to allegations of cheating I need to project distributions of errors by unaided human players of all skill levels accurately. I have been aware of ramped-up error below and away from 0.00 since 2008, when I devised a pleasingly symmetric way to smooth it out: Take a differential dμ(x) that has value 1 at x = 0.00 but tapers off to 0 symmetrically. This says that the marginal value of a centipawn is the same whether you are ahead or behind the same amount but lessens when the (dis)advantage is large. When a player makes an error of raw magnitude e in a position of value v, charge not e but rather the integral of dμ(x) from x = v – e to x = v. The metric can be weighted to make 0.00 effectively as wide as neighboring intervals.
The idea is that if you blunder when, say, 0.50 ahead, the integral will go through the fattest part of the metric and so charge near face value. But when 0.50 behind it will go through the thinner region -0.50 to -0.50 – e and so record a lesser value for my analyzer. This “metric correction” balanced and sharpened my system in ways I could verify by trials such as those I described on this blog in 2011 here.
Privately I’ve indulged an analogy to how relativity “corrects” the raw figures from Newtonian physics. But with today’s chess programs cannonballing to greater depths, I can tell when carrying over my previous calibrations from the earlier Rybka and Houdini engines that this method is being strained. Having now many times the amount of data I previously took on commodity PCs has made the quantum jump at 0.00 so clear as to need special attention. Making an ad hoc change in my model’s coefficients for positions of non-positive value still feels theoretically uncomfortable, but this may be what the data is dictating.
Open Problems
How can you explain the phenomenon? How should regard for phenomenology—in both experimental and philosophical senses—influence the manner of handling it?
Update 2/6/16: I received a suggestion to modify one line of the Stockfish 7 code to avoid shallow-depth “move count based pruning” at nodes belonging to the current principal variation. This inserts a third conjunct && !PvNode into what is currently the second block of “step 13” in the Stockfish source file search.cpp. Re-running the data for the Elo 2600–2800 levels produced the following table:
Evidently the change greatly reduces the depth-5 “noise” but still leaves it distinguishably higher than the depth-5 figures for Komodo 9.3.
This is an interesting phenomenon and a neat question. I don’t know that I have a good hypothesis for explaining it.
Thanks, JN. For some footnotes:
Here’s another notable occasion in which Seneca the Younger referred to chess (at least how chess was in his day—and it may have been checkers) in a source I can pinpoint: “He [Julius Canus, a prisoner condemned by Julius Caesar] was playing chess when the centurion who was dragging off a whole company of victims to death ordered that he also be summoned. Having been called, he counted the pawns and said to his partner: ‘See that after my death you do not claim falsely that you won’; then nodding to the centurion, he said: ‘You will bear witness that I am one pawn ahead.’ Do you think that at that board Canus was playing a game? Nay, he was making game!” At least execution assured that Canus wouldn’t make a 1.50 blunder…
There’s also this: “It would be tedious to mention all the different men who have spent the whole of their life over chess or ball(-playing) or the practice of baking their bodies in the sun…” [He then segues to his main point bewailing academic time spent on “useless literary problems”.]
What about chess programs, do they also blunder more in a slight advantage? You could analyze their older games too to understand whether this phenomenon is really psychological or rather a feature of the game.
Inspired by Kolmogorov complexity and 1k demos: Are there computer chess competitions where the engines have access to the same hardware, but their source code size is limited to be small? Say, could a 2kbyte source code chess engine beat a human grandmaster on current hardware? Or are hundreds of domain-specific heuristics and optimizations still necessary for good results?
Domotorp—a good question; I have it in mind to pursue. Partial answers are charted in this paper which includes the rational hypothesis mentioned in the post. The low overall blunder rate may make it hard to get enough data, though.
Dániel, I don’t know exactly, but last September the Daily Dot’s “The Kernel” ran an article about goings-on there.
Thanks for the link! But it’s not exactly what I had in mind. I’m thinking, could 3 pages of clever pseudocode, when implemented, beat a human champion? Or are opening libraries, pruning heuristics, machine-learned scoring rules and such still necessary for machine supremacy?
Yet another way of looking at it: If you have a generic game-agnostic game-playing engine, and all you are allowed to do is to plug the rules of chess into it, not even allowing reinforcement learning, how good would it be on current hardware?
I know it’s not Ken’s speciality, so if anyone else could chime in… I haven’t even found links to game-agnostic game-playing engines, but they must exist, I’m sure.
The points that this Gödel’s Lost Letter raises are well-illustrated in chess analyst Daniel King’s account of yesterday’s game between Loek Van Wely (Elo 2640, playing white) and world #1 Magnus Carlsen (Elo 2844, playing black).
In essence, Carlsen sought to maximize the tension in the position — accepting an objective disadvantage, and even sacrificing a piece, to do so — in order to create opportunities for Van Wely to make mistakes. Sure enough, in the latter half of the game and in time-trouble, Van Wely made several inaccurate moves; his position crumbled, and Carlsen won.
Would Carlsen have played more conservatively, against a higher-rated opponent? Maybe! And to play in Carlsen’s human, swashbuckling, error-catalyzing style against an unerring modern-day computer program would be fatal.
The point Carlsen’s move-evaluations against human opponents like Van Wely may well be different from his move-evaluations against computers. Here are more opportunities for chess research!
Thank you for yet another thought-provoking GLL chess essay.
PS At this very minute (14:34 UTC/GM, Friday, Jan. 22, 2016) the game Carlsen-Tomashevsky at Wijk an Zee/Tata Steel is following the same pattern.
You beat me to saying it!🙂 Here is an example line if you click on White’s 17.Rf1 in the ChessBomb window here and then click on the response 17…fxe3 at lower left:
(17… fxe3 18. Rxf6 exd2 19. Rxd2 Qd8 20. Rf4 Qg5 21. Rg4 Qh6 22. dxc5 f6 23. Nd7 Rf7 24. Qxe6 Qxd2 25. Nxf6+ Kf8 26. Nh7+ Kg8 27. Nf6+)
Crazy but drawn by repeated checks.
Hi Ken–
Here’s a question for you: have you tried evaluating this factoring for trends? I’ve had a pretty fair number of games where I was worse, and when I managed to achieve a slight advantage my opponents fell apart. My opponents’ blunders have been fewer, it seems, when their slight disadvantage was of a more enduring sort. Along these lines, I vaguely recall Yermolinsky writing about the effects of evaluation trends and how people fare under them in his *The Road to Chess Improvement*.
It’s said that in bad positions there are no good moves. What’s the relationship between the number of moves which maintain the current evaluation within some bound and the evaluation? It might be that +0.1 is significantly more tolerant to e.g fourth best than -0.1, even if you factor out the 0.00 safety net (I guess you can restrict to positions in which the top 4 evals don’t contain 0.00). I mention fourth best because I think I remember Anand saying that Carlsen rarely does anything amazing but almost always plays one of the top 4 computer moves, so that would seem to be a reasonable model for understanding top level variation in error.
If it’s a psychological phenomenon, it could be that if your goal is to win with highest probability (rather than to “lose by the minimum possible amount”), then when losing, it’s rational to make unusual moves which it’s harder for you to evaluate, on the theory they’ll also be harder for your opponent to evaluate, thus introduce some randomness, thus perhaps get you out of losing. (But of course you can easily make mistakes when doing that.) Whereas if you’re ahead, you should play conservatively and stay ahead. This theory (which I only suggest, not claim) would predict the same phenomenon in all kinds of human-human games where the goal is to win (i.e. most of them), if they’re complex enough to admit moves with hard-to-predict effects.
Dear Professor Regan,
We pretend to show the answer of the P versus NP problem. We start assuming the hypothesis of P = NP that will lead us to a contradiction. The argument is supported by a Theorem that states if P = NP, then the problem SUCCINCT HAMILTON PATH would be in P.
In this Theorem, we take an arbitrary succinct representation C of a graph G_{C} with n nodes, where n = 2^{b} is a power of two and C will be a Boolean circuit of 2 * b input gates. Interestingly, if C is a “yes” instance of SUCCINCT HAMILTON PATH, then there will be a linear order Q on the nodes of G_{C}, that is, a binary relationship isomorphic to “<" on the nodes of G_{C}, such that consecutive nodes are connected in G_{C}.
This linear order Q must require several things:
* All distinct nodes of G_{C} are comparable by Q,
* next, Q must be transitive but not reflexive,
* and finally, any two consecutive nodes in Q must be adjacent in G_{C}.
Any binary relationship Q that has these properties must be a linear order, any two consecutive elements of which are adjacent in G_{C}, that is, it must be a Hamilton path.
On the other hand, the linear order Q can be actually represented as a graph G_{Q}. In this way, the succinct representation C_{Q} of the graph G_{Q} will represent the linear order Q too. We can define a polynomially balanced relation R_{Q}, where for all succinct representation C of a graph: There is another Boolean circuit C_{Q} that will represent a linear order Q on the nodes of G_{C} such that (C, C_{Q}) is in R_{Q} if and only if C is in SUCCINCT HAMILTON PATH.
We finally show R_{Q} would be polynomially decidable if P = NP. For this purpose, we use the existential second-order logic expressions, used to express this graph-theoretic property (the existence of a Hamilton path), that are described in Computational Complexity book of Papadimitriou: Chapter "5.7 SECOND-ORDER LOGIC" in Example "5.12" from page "115". Indeed, we just simply replace those expressions with Universal quantifiers into equivalent Boolean tautologies formulas.
Certainly, a language L is in class NP if there is a polynomially decidable and polynomially balanced relation R such that L = {x: (x, y) in R for some y}. In this way, we show that SUCCINCT HAMILTON PATH would be in NP if we assume our hypothesis, because the relation R_{Q} would be polynomially decidable and polynomially balanced when P = NP. Moreover, since P would be equal to NP, we obtain that SUCCINCT HAMILTON PATH would be in P too.
But, we already know if P = NP, then EXP = NEXP. Since SUCCINCT HAMILTON PATH is in NEXP–complete, then it would be in EXP–complete, because the completeness of both classes uses the polynomial-time reduction. But, if some EXP–complete problem is in P, then P should be equal to EXP, because P and EXP are closed under reductions and P is a subset of EXP. However, as result of the Hierarchy Theorem the class P cannot be equal to EXP. To sum up, we obtain a contradiction under the assumption that P = NP, and thus, we can claim that P is not equal to NP as a direct consequence of applying the Reductio ad absurdum rule.
You could see the details in:
https://hal.archives-ouvertes.fr/hal-01249826v3/document
Professor, I propose you investigate a “complex middlegame” position of PC vs human, and see, when there are lots of possible moves to be played (more than the average 30 or so legal moves per position) whether, as I suspect, the PC blunders less than the human. Perhaps in this study you can ‘normalize’ the games such that they are ranked by number of legal moves, and see if the human blunders arise out of the fact that there are “too many moves for the human to consider” (whether or not ahead or behind in evaluation). BTW complex middlegames arise out of certain openings, for example the Slav defense to 1.d4 d5 2. c4 c6 is notorious for complex middlegames, and I’m sure some opening experts can advise as to others.
That’s a good and workable idea. Indeed this week we discussed how to rigorize my spotcheck-assessment that a lot of the positions valued exactly 0.00 are really complicated.
PCs do indeed blunder less than humans—you can find some further ramifications of that in the 2014 paper. That may make it hard to gather enough data on them…
Yes, thank you, I saw that paper even before your post. I also misread slightly your original post, thinking it was man vs machine, but it doesn’t change my answer much.
So just to be clear, my proposal is testing ‘what causes the fear factor of players choking when slightly behind in evaluation’ that your paper is referring to. The hypothesis to be tested (null hypothesis) is that there is no difference in what the paper found (that players slightly negative in the evaluation function of a chess position tend to make more blunders than their opponent) when the position is “complex” (more than ~30 legal moves) as opposed to when the position is “simple”. If so, then the null hypothesis is correct. If however a difference is found, and the ‘complex’ positions show this ‘fear factor’ but simple positions don’t, then you might be onto something. An ‘explanation’ might be: in a complex position, the side ahead has many ways to win, while the side behind has to find ‘the key defensive move’ to thwart the ‘many ways to win’, which is much harder to do. Or, conversely, if the ‘simple’ positions show a ‘fear factor’ at play, but complex ones don’t, then the ‘explanation’ might be that the player behind in the evaluation, that plays poorly, is simply psychologically impaired, since logically in a simple position they should find the correct defense more easily. Again, if no difference is found between the two groups, then the hypothesis that simple vs complex matters is bogus. Just an idea. Good luck!
The phenomenology varies, but in other domains “cognitive friction” arises as one falls behind. Questions abound. What precipitated this turn of events? Was it a stupid error? Why did I not foresee this? Can I recover, or am I just not good enough? Domain-specific threats! More! The “friction” can cascade into error as fluency tends to decrease with the increasing cognitive load.
This phenomenon is more accentuated for better players. Perhaps it is because the better you are, the better you are at assessing the position (probably the better you are at assessing the position in accordance with computer valuations). Then the better you are, the more will you be able to see that you are in a worse position. Clumsier players tend to soldier on, thinking that the position is equal for a longer time than the position merits. So better players would be more influenced by psychological factors in (slightly) worse positions. Hard to tell if this would be a full explanation of the phenomenon.
Prof Regan is right to focus our attention on falsifiable hypotheses. His proposed explanation about the “attraction basin” seems compelling; to make a falsifiable hypothesis out of it, we might rate positions not using the engine’s own equity valuation, but instead the Expected Value in terms of expected points to be won or lost.
Thus, making a blunder that brings you from +1.0 to +0.0 may be a loss of only one point of equity, but what did it do to your expected number of points? Depending on the game situation and the level of the players, the effect may be massive.
If an error changes the game expectations from “75% win, 20% draw” to “1% win, 99% draw”, that is a loss of 0.345 game points. When positions are moved onto that scale, some or all of the observed “firewall” of zero may go away.
Of course this is only conjecture until we have a function that could reliably map game situation and player rating to expected win probability.
David, thanks for the perspicacity. The graph of y = game expectation versus x = position evaluation follows a beautiful logistic curve, which you can see plotted in last fall’s “Depth of Satisficing” post, including the effect of rating – opponent’s rating. The diagram comes from the RBZ14 paper which has other plots showing some but minimal noise at 0.00—nothing like the above.
So in that sense, there is almost no “expectation effect” at 0.00. But this really underscores the disconnect to what happens with large mistakes and the per-move error rate. More of the disconnect is that the curve retains much of its symmetry even when it is plotted only including values for the player to move (see the RBZ14 paper for that and a human disparity that is absent in computer play). Our efforts so far to use the curve as a metric by which to correct the issue haven’t worked as well as what the post describes.
I don’t, honestly, know a ton about chess, the quirks of how engines rate, and the other papers or data pointed to in the comments. So, may be nonsense ahead.
For 0.00 itself the draw explanation makes some sense to me. It seems plausible that computers would not infrequently spot draws that even bright humans miss; finding a draw seems like it’s fundamentally a search problem. (Kasparov missed a draw vs. Deep Blue, right?)
For the +/- difference, another factor that occurs to me is selection effects–the weaker player in a given game will tend to already be in the worse position when you evaluate, and also more likely to make mistakes. That’s true even if “weaker” is just “having a bad day” and not worse on average over their lifetime; as long as there’s some correlation between a player’s performance earlier in the game and later, you’d expect more blunders in negative positions.
“Chess is hard. Finding a best move in a given position or even telling if you’re winning is a concrete case of a {\mathsf{PSPACE}}-hard problem.”
Here Ken must be referring to some extrapolation of chess involving chessboards of arbitrarily large finite size.
However, the statement can make sense for actual chess. This involves a robust choice of a finite model of computation. But it seems daunting to actually be able to prove such finite hardness.
Thanks for the interesting post and study.
I have a potential explanation for the larger amount of blunders at 0.00 that I haven’t seen mentioned yet.
My suspicion is that the blunders in the 0.00 category are mostly made in positions where the computer can calculate until the end and has or thinks it has (almost) absolute knowledge, i.e. endgame positions. In this case, for a computer there are only really 3 evaluations: lost, draw (hence the 0.00 evaluation) or win. So any mistake is immediately qualified in your study as a blunder.
In middlegame positions the computer has no absolute knowledge yet, and mistakes (like the loss of a pawn) are not qualified as blunder per se, even at great depth, though they might well be losing in the absolute all-knowing sense.
This can be tested by checking if there is a correlation between amount of material left and blunder rate in this and the other categories, or amount of material versus evaluation (thesis: less material for 0.00 positions).
For the larger blunder rate when at a disadvantage, I agree with your observations of safety net.
I’ve thought about this, but my spot-checks have turned up lots of complicated middlegame positions. I haven’t yet applied a filter for certain kinds of endgame positions.