For chess and science: a cautionary tale about decision models

 Clarke Chronicler blog source

Marmaduke Wyvill was a British chess master and Member of Parliament in the 1800s. He was runner-up in what is considered the first major international chess tournament, London 1851, but never played in a comparable tournament again. He promoted chess and helped organize and sponsor the great London 1883 chess tournament. Here is a fount of information on the name and the man, including that he once proposed marriage to Florence Nightingale, who became a pioneer of statistics.

Today we use Wyvill’s London 1883 tournament to critique statistical models. Our critique extends to ask, how extensively are models cross-checked?

London is about to take center stage again in chess. The World Championship match between the current world champion, Magnus Carlsen of Norway, and his American challenger Fabiano Caruana will begin there on November 9. This is the first time since 1972 that an American will play for the title. The organizer is WorldChess (previously Agon Ltd.) in partnership with the World Chess Federation (FIDE).

The London 1883 tournament had two innovations. It was the first to use chess clocks. The second was that in the event of a game being drawn, the players had to play another game, twice if needed. Only after three draws would the point be considered halved, and this happened only seven times in 182 meetings. Chess clocks have been used in virtually every competition since, but the second experiment has never been repeated—the closest to it will come next year. Two scientific imports were that the time to decide on a move was regulated and games without critical action were set aside.

## Chess and Science

Chess has long been considered a (or “the”) “game of science.” It has been the focus of numerous scientific studies. Here we emphasize how it is a copious source of scientific data. Millions of games—every top-level game and nowadays many games at lower levels—have been preserved in databases. Except that the time taken by players to choose a move at each turn is recorded only sporadically, we have easy access to full information about each player’s choices of moves.

What we also have now is authoritative judgment on the true values of those choices via analysis by strong computer chess programs. Those programs, called “engines,” can beat even Carlsen and Caruana with regularity, so we humans have no standing to doubt their judgments. The programs’ values for moves are a robust quality metric, and correlate supremely with the Elo Rating, which provides a robust skill metric.

The move values are the only chess-specific input to my statistical choice model. I have covered it several times before, but not yet in the sense of going “back to square one” to say how it originated—where it fits among decision models.

This year I have overhauled the model’s 28,000+ lines of C++ code. More exactly I have “underhauled” it by chopping out stale features, removing assumptions, and simplifying operations. I widened the equations to accommodate multiple alternative models and fitting methods, besides the ones I’ve deployed to judge allegations of cheating on behalf of FIDE and other chess bodies. The main alternative discussed here is one I did already program and reject nine years ago, but having recently tried multiple other possibilities reinforces the points about models that I am making here. So let’s first see one general form of decision model and how the chess application fits the framework.

## The Multinomial Logit Model

The general goal is to project the probabilities ${p}$ of certain decision choices or event outcomes ${m}$ in terms of data ${X}$ about the ${m}$ and attributes ${S}$ of the decision makers. An index ${i}$ can refer to multiple actors and/or multiple situations; we will suppress it when intent is clear. The index ${j}$ refers to multiple alternatives ${m_j}$ (${j = 1,\dots,J}$) in any situation and their probabilities ${p_j}$. The goal for any ${i}$ is to infer ${(p_j) = \{p_{i,j}\}}$ as a function ${G}$ of ${X = \{X_{i,j}\}}$, ${S_i}$, and internal model parameters ${\vec{\beta}}$. The ${G}$ function is “the” model.

The models we consider all incorporate into ${G}$ a function that takes ${X}$ and ${S_i}$ and outputs quantities ${u_{i,j}}$—which we will speak of as single numbers but which could be vectors over a separate index ${k}$. In many settings, ${u_{i,j}}$ this represents the utility of outcome ${j}$ for the actor or situation ${i}$, which the actor wants to maximize or at least gain enough of to satisfy needs. Insofar as ${u_{i,j}}$ depends on ${S_i}$ it is distinct from a neutral notion of “objective value” ${v_{i,j}}$. Such a distinction was already observed in the early 1700s.

The multinomial logit model, and log-linear models in general, represent the logarithms of the probabilities as linear functions of the other elements. Using the utility function this means setting

$\displaystyle \log(p_j) = \alpha + \beta u_j, \ \ \ \ \ (1)$

where we have suppressed ${i}$ and ${\beta u_j}$ could be multiple linear terms ${\beta_1 u_{j,1} + \cdots + \beta_k u_{j,k}}$. This makes

$\displaystyle p_j = e^{\alpha + \beta u_j} \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

for all ${j}$. Then ${\alpha}$ becomes a normalization constant to ensure that the probabilities sum to ${1}$, dropping out to give the final equations

$\displaystyle p_j = \frac{e^{\beta u_j}}{\sum_{\ell=1}^J e^{\beta u_\ell}}. \ \ \ \ \ (3)$

Fitting ${\beta}$ thus yields all the probabilities. Note that putting a difference of probabilities ${\log(p_1) - \log(p_i)}$ on the left-hand side of (1), which is the log of the ratio of the probabilities, leads to the same model and normalization (up to the sign of ${\beta}$). The function of normalizing exponentiated quantities is so common it has its own pet name, softmax.

These last three equations were known already in 1883 via the physicists Josiah Gibbs and Ludwig Boltzmann, with ${\beta}$ coming out in units of inverse temperature, the denominator of (3) representing the partition function of a physical system, and the numerator the Boltzmann factor. It seems curious that apart from some contemporary references by Charles Peirce they were not used in wider contexts until the World War II era. Equation (3) essentially appears as equation (1) in the 2000 Economics Nobel lecture by Daniel McFadden, who calls it “the” multinomial logit model (see also this) and traces it to work by Duncan Luce in 1959. Such pan-scientific heft makes its failure in chess all the more surprising.

## The Chess Case

In chess tournaments we have multiple players, but only one is involved in deciding each move. So we focus on one player but can treat multiple players as a group. Instead what we represent with the ${i}$ index is the player(s) facing multiple positions. To a large extent we can treat those decisions as independent. Even if the player executes a plan over a few moves the covariance is still sparse, and often players realize they have to revise their plans on the next turn. Thus we replace ${i}$ by an index ${t}$ signifying “turn” or “time” for each position faced by the player. Clearly we want to fit by regression over multiple turns ${t}$, so ${\beta}$ and any other fitted parameters will not depend on ${t}$, which again we sometimes suppress.

Each possible move ${m_j}$ at each turn ${t}$ is given a value ${v_{t,j}}$ by the chess engine(s) used to analyze the games. We order the moves by those values, so the engine’s first-listed move ${m_1}$ has the optimal value ${v_1}$. In just over 90% of positions there is a unique optimal move. There are four salient ways to define the utility ${u_j = u_{t,j}}$ from these values—prefatory to involving model parameters describing the player:

• (a) Equate ${u_j}$ with the move’s value ${v_j}$.

• (b) Equate it with the win expectation ${e_j}$ corresponding to ${v_j}$, which I described along with its “sliding-scale” issues in this recent post.

• (c) Use the loss in value ${v_1 - v_j}$ compared to an optimal move.

• (d) Use the loss in expectation ${e_1 - e_j}$ instead.

Option (d) automatically scales down differences in positions where one side is significantly ahead. The same small slip that would halve one’s chances in a balanced position might only reduce 90% to 85% in a strong position or be nearly irrelevant in a losing one. I remove the most extremely unbalanced positions from samples anyway. For (c) I use a “non-sliding” scale function ${\delta(v_1,v_j)}$ whose efficacy I detailed here but I can easily generate results without it. Note that if I were to cut the sample down only to balanced positions—${v_1}$ near ${0.00}$—of which there are a lot, then (a) and (b) become respectively equivalent to (c) and (d) anyway, up to signs which are handled by flipping the sign of ${\beta}$.

My primary model parameter, called ${s}$ for “sensitivity,” is just a divisor of the values and so gets absorbed by ${\beta}$. I have a second main parameter ${c}$ for “consistency” but more on it later. Having ${s}$ is enough to fill the dictates of the multinomial logit model in the simplest manners.

Criteria for fitting log-linear models are also a general issue. For linear regressions, least-squares fitting is distinguished by its being equivalent to maximum-likelihood estimation (MLE) under Gaussian error, but ${L_1}$ or other ${L_p}$ distance can be minimized instead. With log-linear regressions the flex is wider and MLE competes with criteria that minimize various discrepancies between quantities projected from the fitted probabilities ${p_{t,j}}$ and their actual values in the sample. Here are four of them—we will see more:

1. The frequency of playing (i.e., “matching”) the move ${m_1}$.

2. The frequency of playing ${m_1}$ or another move of equal-optimal value (for the 10% of positions that have them).

3. The total “loss of Pawns,” whether scaled as ${\delta(v_1,v_j)}$ or left unscaled as ${v_1 - v_j}$.

4. The total loss of expectation ${e_1 - e_\ell}$ between the optimal move and the played move ${m_\ell}$.

The reason for the first three in particular is that they create my three main tests for possible cheating, so I want to fit them on my training data (which now encompasses every rating level from Elo 1025 to Elo 2800 in steps of 25) to be unbiased estimators. Besides those and MLE—which here means maximizing the projected likelihood of the moves that were observed to be played (or alternately the likelihood of the observed ${m_1}$-match/non-match sequence and various others)—my code allows composing a loss function from myriad components and weighting them ad-lib. Components unused in the fitting become the cross-checks.

Ideally, we’d like all the fitting criteria to produce similar fits—that is, close sets of fitted values for ${\beta}$ and other parameters on the same data. Finally, the code implements other modeling equations besides multinomial logit—and we’d like their results to agree too. But let’s first see how multinomial logit performs.

I analyzed the 168 official played games of London 1883 (one competitor left just after the halfway point), and separately, the 76 rejected draws, using Stockfish 7 to high depth on single threads furnished by UB’s Center for Computational Research (CCR). The former give 10,289 analyzed game turns after applying the extreme-value cutoff and a few others. Using the simple unscaled version (c) of utility and fitting ${\beta \equiv 1/s}$ according to ${m_1}$ matching gives these results for the first three fitting criteria:

          Test Name       ProjVal   Actual   Proj%  Actual%  z-score
MoveMatch       4871.02  4871.00  47.34%  47.34%  z = -0.00
EqValueMatch    5228.95  5201.00  50.82%  50.55%  z = -0.73
ExpectationLoss  259.20   297.34  0.0252  0.0289  z = -10.58


This is actual output from my code, except that to avoid crowding I have elided some columns including the standard deviations on which the ${z}$-scores are based. The ${z}$-scores give a uniform way to judge goodness of fit. The first one is exactly zero because that was the criterion expressly fitted. The fitted model generates a projection for the second one that is higher than what actually happened at London 1883, but only slightly: the ${z}$-score is within one standard deviation. The third, however, is under-projected by more than 10 standard deviations. In absolute terms it doesn’t look so bad—259 is only 13% smaller than 297—but the large ${z}$-score reflects our having a lot of data. Well, there’s large and there’s huge:

           Test Name       ProjVal   Actual  Proj%  Actual%  z-score
AvgDifference   1493.46  2780.07  0.145  0.270  z = -60.9638


The projection is only half what it should be. The ${z}$-score is inconceivable.

For more cross-checks, there are the projected versus actual frequency of playing the ${k}$-th best move for ${k > 1}$. Here is the table for ranks 1–10:

           Rk  ProjVal  Actual   Proj%  Actual%  z-score
1  4871.02  4871.00  47.34%  47.34%  z = -0.00
2  1416.72  1729.00  13.80%  16.85%  z = +9.44
3  761.84    951.00   7.47%   9.32%  z = +7.33
4  523.25    593.00   5.19%   5.88%  z = +3.20
5  401.63    410.00   4.03%   4.11%  z = +0.43
6  325.30    295.00   3.29%   2.99%  z = -1.73
7  272.12    247.00   2.77%   2.51%  z = -1.57
8  232.05    197.00   2.37%   2.01%  z = -2.36
9  200.88    169.00   2.06%   1.73%  z = -2.30
10  175.95    104.00   1.81%   1.07%  z = -5.54


The first row was the one fitted. Then the projections are off by three percentage points for ${m_2}$ and almost two for ${m_3}$. For ranks 5–9 they are tantalizingly close but by ${m_{10}}$ they have clearly overshot—as they must for the probabilities to add to 1.

There are yet more cross-checks of even greater importance. They are the frequency with which players make errors of a given range of magnitude: a small slip, a mistake, a serious misstep, a blunder. Those results are too gruesome to show here. Fitting by MLE helps in some places but throws off the ${m_1}$ fit entirely.

The huge gaps in these and especially in the “AvgDifference” test (AD for short) rule out any patch to the log-linear model with one ${\beta}$. I have tried adding other linear ${\beta_k}$ terms representing features such as a move ${m_j}$ turning an advantage into a disadvantage (${v_1 > 0}$ but ${v_j < 0}$). They give haywire results unless the nonlinearity described next is introduced.

## Log-Linear With Revised Utility: Still Bad

This is to define the utility function using a new parameter ${c}$ as

$\displaystyle u_j = \delta(v_1,v_j)^c.$

Without scaling this is just ${(v_1 - v_j)^c}$; one can also use ${(e_1 - e_j)^c}$ or put the power on the values separately. In forming ${\beta u_j}$ it does not matter whether the fitted value is represented as ${\beta}$ or as ${\beta^c}$. Using the notation ${\beta = \frac{1}{s}}$, so that ${s}$ divides out the “pawn units” of ${v_1}$ and ${v_j}$, this means that without loss of generality we can write

$\displaystyle u_j = \left(\frac{\delta(v_1,v_j)}{s}\right)^c.$

This makes clear that the quantity being powered is dimensionless. The motivation for ${c}$ is that in any quantity of the form ${(\delta/s)^c}$, the marginal influence of ${c}$ becomes greater for large ${\delta}$ than that of ${s}$. Thus ${c}$ can be said to govern the propensity for making large mistakes while ${s}$ governs the perception of small differences ${\delta}$ in value. Higher ${c}$ and lower ${s}$ correspond to higher skill. The former connotes the ability to navigate tactical minefields, the latter strategic skill of amassing small advantages.

Thus I regard ${c}$ as natural in chess. In my results, ${c}$ usually fits with values going up from below ${0.45}$ to about ${0.55}$ as the Elo level increases. This is in the rough neighborhood of square-root and definitely apart from ${c=1}$. It also changes the calculus on a property called “independence from irrelevant alternatives,” which McFadden cites from Luce but has issues discussed e.g. here.

Since ${c}$ is part of the revised utility function, the model is still log-linear in the utility and the probabilities are still obtained via the procedure (1)–(3). The end-product is that having ${c}$ allows fitting two criteria exactly to yield them as unbiased estimators. Here are the results of fitting ${m_1}$ and AD in what is now a “log-radical(${c}$)” model:

           Test Name       ProjVal   Actual   Proj%  Actual%   z-score
MoveMatch       4870.99  4871.00  47.34%  47.34%  z = +0.00
AvgDifference   2780.10  2780.07  0.2702  0.2702  z = +0.00
EqValueMatch    5261.35  5201.00  51.14%  50.55%  z = -1.44
ExpectationLoss  413.42   297.35  0.0402  0.0289  z = +15.89


The equal-optimal projection remains OK. The expectation loss, however, flips from an under-projection to a vast over-projection. The cross-checks from the move ranks give further bad news:

           Rk  ProjVal  Actual   Proj% Actual%  z-score
1  4870.99  4871.00  47.34% 47.34% z =  +0.00
2  1123.22  1729.00  10.94% 16.85% z = +19.88
3  633.30   951.00   6.21%  9.32%  z = +13.27
4  459.83   593.00   4.56%  5.88%  z =  +6.44
5  370.58   410.00   3.72%  4.11%  z =  +2.11
6  311.98   295.00   3.16%  2.99%  z =  -0.99
7  270.56   247.00   2.75%  2.51%  z =  -1.46
8  239.36   197.00   2.44%  2.01%  z =  -2.79
9  214.30   169.00   2.19%  1.73%  z =  -3.15
10  193.93   104.00   1.99%  1.07%  z =  -6.57


The discrepancy in the second-best move has doubled to six percentage points while the third-best move is off by more than three.

Maximum-likelihood fitting makes the gaps even worse. No re-jiggering of fitting methods nor the formula for ${u_{t,j}}$ comes anywhere close to coherence. Inconsistency in the second-best move kills everything. The fault must be tied all the way to the log-linear model for the probabilities.

## Two Logs Good

As we have noted, taking the difference of logs, and inverting so that signs stay positive like so:

$\displaystyle \log(1/p_j) - \log(1/p_1) = \alpha + \beta u_j,$

does not change the model. The likelihoods ${e^{\alpha + \beta u_j}}$ are still normalized arithmetically as in the Gibbs equations. Taking a difference of double logarithms, however, yields something different:

$\displaystyle \begin{array}{rcl} \log\log(1/p_j) - \log\log(1/p_1) &=& \alpha + \beta u_j\\ \implies \frac{\log(1/p_j)}{\log(1/p_1)} &=& \exp(\alpha + \beta u_j)\\ \implies \log(p_j) &=& (\log(p_1))\exp(\alpha + \beta u_j)\\ \implies p_j &=& p_1^{\exp(\alpha + \beta u_j)}. \end{array}$

With the utility ${u_j}$ still defined as ${\delta(v_1,v_j)^c}$ this creates a triple stack of exponentials on the right-hand side. This all looks really unnatural, but see the results it gives, now also showing the interval and large-error tests that were “too gruesome” before:

           Test Name       ProjVal   Actual   Proj%  Actual%  z-score
MoveMatch       4871.02  4871.00  47.34%  47.34% z = -0.00
AvgScaledDiff   1142.61  1142.59   0.111   0.111 z = +0.00
EqValueMatch    5251.90  5201.00  51.04%  50.55% z = -1.10
ExpectationLoss  333.20   334.46  0.0324  0.0325 z = -0.19

Rk ProjVal  Sigma    Actual  Proj% Actual%  z-score
1  4871.02  47.02  4871.00 47.34% 47.34%  z = -0.00
2  1786.89  37.32  1729.00 17.41% 16.85%  z = -1.55
3   929.87  28.60   951.00  9.11%  9.32%  z = +0.74
4   589.93  23.29   593.00  5.85%  5.88%  z = +0.13
5   419.35  19.84   410.00  4.21%  4.11%  z = -0.47
6   315.24  17.32   295.00  3.19%  2.99%  z = -1.17
7   246.68  15.39   247.00  2.51%  2.51%  z = +0.02
8   198.71  13.85   197.00  2.03%  2.01%  z = -0.12
9   161.54  12.52   169.00  1.65%  1.73%  z = +0.60
10   134.18  11.43   104.00  1.38%  1.07%  z = -2.64
11   111.41  10.43    97.00  1.15%  1.00%  z = -1.38
12    93.90   9.59    99.00  0.97%  1.02%  z = +0.53
13    77.94   8.75    76.00  0.81%  0.79%  z = -0.22
14    65.40   8.02    78.00  0.68%  0.82%  z = +1.57
15    55.13   7.37    62.00  0.58%  0.65%  z = +0.93

Selec. Test ProjVal  Actual  Proj%  Actual%  z-score
Delta01-10   656.08  645.00  6.38%  6.27%  z = -0.56
Delta11-30   800.75  824.00  7.78%  8.01%  z = +1.02
Delta31-70   596.51  607.00  5.80%  5.90%  z = +0.50
Delt71-150   295.70  290.00  2.87%  2.82%  z = -0.38
Error>=050   709.46  675.00  6.90%  6.56%  z = -1.55
Error>=100   331.88  300.00  3.23%  2.92%  z = -2.02
Error>=200   141.56  114.00  1.38%  1.11%  z = -2.62
Error>=400    68.76   35.00  0.67%  0.34%  z = -4.61


Only the first two lines have been fitted. The other lines follow like obedient ducks—and this persists through all tournaments that I have run.

There are some wobbles that also persist: The second-best move is somewhat over-projected and the third-best move slightly under—but the remaining indices are off by small amounts whose signs seem random. So are the interval tests at the end, except that large errors are over-projected. The match to moves of equal-optimal worth tends to be over-projected regardless of the patch described here. Nevertheless, the overall fidelity under so much cross-validation is an amazing change from the log-linear cases.

## Open Problems

The most particular issue I see grants that the original log-linear formulation could be fine for a one-shot purpose, say if the match-to-${m_1}$ cheating test were the only thing cared about. The concern is that in the absence of validation beyond what is needed for that, “mission creep” could extend the usage unknowingly into flawed territory. It is important to me that a model should score well on a larger slate of pertinent phenomena. Do other models have as rich a field of data and cross-checks as in chess?

Is there extensive literature on modeling the double logarithms of probabilities—and on representing probabilities as powers rather than multiples of the “pivot” ${p_1}$? We have seen scant references. The term “log-log model” instead refers to having a logarithm on both sides, e.g., ${\log(p_j) = \alpha + \beta\log(x_j)}$. Alternatives to log-linear models need to be more conscious of error terms in the utility functions, so perhaps uncertainty needs a more express representation in my formulas.

The form where ${p_j = p_1^{L_j}}$ does have the general issue that when ${p_1}$ should be very close to 1—as for a completely obvious move in chess—there is strain on getting the exponents ${L_j}$ large enough to make ${p_j = p_1^{L_j}}$ tiny. The over-projection of large errors (${p_j}$ too high) is a symptom of this. Some of my past posts give my thinking on this, but the implementations have been hard to control, so I would be grateful to hear reader thoughts.

1. October 19, 2018 7:15 am

See my proof of P = NP in