An approach to solving chess endgames using fast sorts

Hector Garcia-Molina is one of the top experts in databases—taken in a wide sense. He has done seminal research in almost all aspects of systems that manage data: from classic databases, to web based information systems; from semantics of data, to security of data. If you are interested in any aspect of data, you will probably have to read one of Hector’s papers; you will also probably have to reference him.

Today I want to recall a project Hector and I ran, in the mid 1980’s, while we were both at Princeton University. The project was called ${M^{3}}$—the Massive Memory Machine Project. The main reason for discussing this old project is related to the recent post on chess endgames.

The goal of the ${M^{3}}$ project was to explore the power of large amounts of main memory. Our thesis was that even “slow” processors with huge amounts of main memory could beat parallel processors, provided the slower processors had much more memory than the parallel ones.

The project bought and constructed machines that had gigabytes of random access memory. In those days typical machines had a few megabytes of memory at most, so getting machines with gigabytes was non-trivial. A simple problem we encountered was that all the standard machines did not even have enough address space to get into the gigabyte range. We got around these problems, but the machines cost us many millions of dollars to build.

It is a bit funny to recall we paid millions of dollars for this amount of memory, when today you can buy it for tens of dollars—and then put it in your pocket. But, I guess the cost of being over two orders of magnitude bigger than all other machines had its price. I guess it would like having a machine with petabytes today.

Hector is a pleasure to work with and I enjoyed our joint work very much. He is a tireless researcher. Two measures prove this: he has almost 400 published papers, and even more impressive has an H-score near 90. The H-score is claimed to be more accurate in measuring research output than just total citations. You have an H-score of ${\ge h}$ provided you have written papers

$\displaystyle P_{1}, \dots, P_{h}$

and each of these papers has at least ${h}$ citations. The point of the H-score is to factor out one huge paper will a “zillion” citations.

Hector was at Princeton when I was being recruited to go there, many years ago. I still recall what he told me about Princeton. He was, then, a junior faculty who had just graduated from Stanford. He pointed out when he was deciding where to go—he had several offers—he selected Princeton for many reasons. But, one reason he explained this way to me:

I noticed the semester at Princeton was only one week longer than the quarter at Stanford.

Hector was a great teacher, but his true passion was research, so having a short teaching semester was pretty neat. I have to say I went to Princeton for many reasons, but I never minded the length of the semesters.

One last story I must tell about Hector—he was always cold. I mean physically cold. Perhaps this is why he moved back to the sunny west coast, because even during the Princeton summers he might wear a sweater.

One day a large company offered to give our department a number of their workstations. The machines were good at running Lisp, but did not support UNIX and so were not useful for our day-to-day tasks. The department asked who wanted one of these machines: I said no thank-you. But, Hector took one—to my surprise.

We were in his office one day chatting, when I noticed there in the corner was one of the workstations. It looked unused, looked untouched, but it was clearly turned on and running. I asked Hector the obvious question: why did he take one of the machines, since he clearly seemed not to be using it. Hector, always logical, said he never used the machine, but he knew they ran “hot”—they were known to have cooling problems. So he kept his on all the time, and used it like a space heater. He said he would have taken two if he could have.

Solving Chess Endgames

I believe solving the ${k}$-piece chess endgame problem can be reduced to a performing many sorts on a very large file. The size of the file is a function of ${k}$: for six pieces it is about ten gigabytes, and for seven pieces it is less than a terabyte. Ken Regan suggests that it grows by a factor of about ${70}$ for each additional piece.

The connection with the ${M^{3}}$ project is one of the problems memory helps with tremendously is sorting. Again more on this in the future—for now I want to explain the connection of the chess problem to sorting. In the near future, I plan to discuss the various methods and issues involved in performing extremely large sorts. Around 18 months back, Google did a record sized sort on one petabyte of data in 6 hours and 2 minutes across 4,000 computers. See this for more details. An example of some of issues that arise in such large sorts is disk failures: Google used so many disks since they had to plan for disk failures during the sort.

Let’s look at the chess problem abstractly. The problem is this: let ${G}$ be a bipartite graph where the left nodes are the positions where white has to make a move and the right nodes are the positions where black has to move. These positions are constrained to have at most the required number of pieces. For . ease of presentation I will make one assumption: the graph models a game with only wins and losses, no draws. This is not fundamental, it will just help make the following clearer—I hope.

Each node of the graph is labeled from the set

$\displaystyle \{ \mathsf{win}, \mathsf{lose}, \mathsf{unknown} \},$

all from white’s point of view. Initially, all nodes are labeled ${\mathsf{unknown}}$. Then, a pass is made over all the positions and as many positions as possible are labeled ${\mathsf{win}}$ or ${\mathsf{lose}}$. This is done by looking at each position, since many will be immediate wins or losses.

The key is to propagate the information from the win/lose nodes to the unknown nodes. The rules are simple: Consider a white node ${w}$ that is currently unlabeled. If it has an edge to a black node with the label ${\mathsf{win}}$, then label the white node ${\mathsf{win}}$; if all its edges go to black nodes with the label ${\mathsf{lose}}$, then label the white node ${\mathsf{lose}}$.

Similarly, there are the “dual” rules for the black nodes. Not exactly. The issue is cycles are possible, but to make our point about using sorting to solve endgames we can safely put this subtle issue aside.

These rules are applied until all nodes become known or until resources are exhausted. Note, if there are nodes with the label ${\mathsf{unknown}}$ still, these positions may be a win or a loss. However, if we have applied the rules many times we can at least be sure that these positions cannot be resolved within a short time.

The central question is what is the cost of propagating the win/lose information to all the nodes. The obvious way is to cycle through the nodes and keep applying the rule. The problem with this approach, I believe, is the random access nature of the lookups. Suppose ${w}$ is a white node. It may have edges to black nodes that are in arbitrary places. Thus, this could be very expensive. This applies even if the whole graph and labels are in main memory—in modern machines random access to memory can be expensive due to cache misses. A good description of these issues and efforts to cache the lookups accompanies Harm Muller’s proposed 7-piece tablebase algorithm.

My suggestion that Ken and I have just started to explore is to avoid random-access lookups altogether, by reduction to sorting. In a small number of sorts we can apply the propagation rules to all the nodes of the entire graph. This still has to be repeated again and again until all the nodes get labeled, but using sorting can be faster than random access.

Essentially, we are thinking of the application of the rules as a kind of database join. Joins are exactly what is needed to do the lookups all at once. Each ${w}$ node may wish to lookup a ${b_{w}}$ node’s value. Here is a small example to explain the idea. Form the file:

 ${w_{1}}$ ${b_{4}}$ ${w_{2}}$ ${b_{4}}$ ${w_{3}}$ ${b_{3}}$ ${w_{4}}$ ${b_{2}}$

The line ${w_{1}}$ and ${b_{4}}$ means that the white position ${1}$ needs to see the label on the black position ${4}$.

After being sorted on the second value, this becomes:

 ${w_{4}}$ ${b_{2}}$ ${w_{3}}$ ${b_{3}}$ ${w_{1}}$ ${b_{4}}$ ${w_{2}}$ ${b_{4}}$

Note, the black values are in ascending order so the values can be looked up fast by a “merge” type operation with the black nodes and their labels. Let us say this yields,

 ${w_{4}}$ ${\mathsf{win}}$ ${w_{3}}$ ${\mathsf{lose}}$ ${w_{1}}$ ${\mathsf{win}}$ ${w_{2}}$ ${\mathsf{win}}$

Then we can sort on the first value and get the white labels in the correct order:

 ${w_{1}}$ ${\mathsf{win}}$ ${w_{2}}$ ${\mathsf{win}}$ ${w_{3}}$ ${\mathsf{lose}}$ ${w_{4}}$ ${\mathsf{win}}$

The whole point of this approach is that the expensive operation is a standard one of sorting. We believe that we may be able to use high performance systems to run this type of algorithm and perhaps solve larger endgames than previously possible.

The issue then becomes, how to govern the sorts to minimize the number of iterations? We have in mind a loop over legal moves in any position, and/or a loop over the depth to mate. The latter is avoided in the standard tablebase algorithm, so that the running time is basically independent of the maximum number of moves it takes to win from positions in a particular endgame. Introducing it here would mean we are trading off cache efficiency against simple sequential runtime. Whether this would yield a large practical gain is yet to be determined. The only thing we can say now is that rewriting the algorithm to use reduction to sorting would make it more immediately applicable on cloud-computing systems.

Open Problems

How far can we go with near term technology on chess endgames? And how far in the future? Also what about just the problem of doing huge sorts. How do we do them? The Google approach used triple redundancy on the disks, since failures were expected. Are there more efficient ways to sort data of this size?

[fixed typo]

May 20, 2010 9:58 am

i must be missing something — why was google using triple redundancy instead of an erasure-correcting code to account for disk failures?

s.

May 20, 2010 10:23 am

Steve,

My guess is the code is simpler. But your idea would be better, it seems.

Thanks

May 20, 2010 12:37 pm

The redundancy also helps with load balancing. Ideally you want to run each piece of the algorithm on the node that holds the data. If there are several such nodes, then you have more options. Also, wouldn’t reconstructing the lost disk from data scattered across nodes be expensive (IIRC, Google doesn’t use any fancy networking, just plain-old ethernet)?

May 20, 2010 3:47 pm

i probably shouldn’t second-guess someone else’s experiment without all of their parameters, but i can’t help but respond to this…

no, decoding doesn’t need to be expensive. as far as “scattered across disks” is concerned, if you’re doing a sort (and again, correct me here if i’m wrong), shouldn’t you expect to need random access to the disks in any case?

proof: sorting is like trying to invert an unknown permutation. there are n! permutations on n elements, which is roughly n^n, or n lg n binary comparisons. assume that you could do the same number of comparisons in nonrandom order. then the sequence of permutations could be compressed. however, any n log n high-K-complexity bitstring describing the permutation both can’t be compressed and describes a unique permutation.

so either you have to do worse than n lg n comparisons, or you (without special knowledge of the unsortedness of the data) have to have random access to the data to sort it. so you’re doing random access anyhow.

aside from all of this, decoding doesn’t require the data to be any more “scattered” than it would be to just read it off of the disks. it actually “load balances” better, since a single read is spread across multiple disks (and they can be spread across consecutive disks, or disks chosen to be balanced in some other way, for instance). this, by way of example, is why striping in disk arrays works so well.

s.

May 20, 2010 4:54 pm

The issue here is not the number of comparisons; it’s the time required for each comparison. The RTT on a vanilla ethernet is about 90 us, so you really want to avoid it whenever possible. You’re assuming that all comparisons take the same amount of time, but they don’t.

Where the load balancing comes in is that if you want to compute on a particular chunk of data, you pretty much have to do that on a node that has a local copy of the data. The more nodes there are that host your data, the lower the chance that you have to queue because the node you need is busy (and the higher the overall processor utilization). If you are processing the entire dataset, then it doesn’t matter so much because every node can process all of the data elements that it holds, but Google’s system is designed to process subsets of the archive in response to queries.

In your last paragraph it sounds like what you’re suggesting is striping volumes across disks attached to different hosts on the LAN, which doesn’t work very well at all. In fact, it pretty much guarantees that no record can be read without eating that ethernet latency, which would be a disaster. If, on the other hand, you’re thinking that you’ll stripe across multiple disks on the same controller (the way conventional RAID works), then note that that doesn’t protect you from failure. If you have a kernel panic in a single host you lose all of the disks on that host. If you have a failure in an ethernet switch, then you lose all of the disks in the rack served by that switch. GoogleFS avoids putting its multiple copies on any two disks that share an upstream failure point. It’s robust against failure, but it would slow to a crawl if you replaced the copies with a single striped copy.

May 20, 2010 5:48 pm

You are assuming the machines our on a ethernet. We are thinking of much higher performance sate of the art sysyems.

May 20, 2010 3:47 pm

I am so flattered by all the nice things Dick says about me!!
The article brings back lots of fond memories of
Princeton and working with Dick… Dick was in many ways my
mentor at Princeton and I’ll always appreciate that very much!!
hector

May 20, 2010 5:46 pm

hold on, your argument is that 3 copies of the data on different hosts will somehow make comparisons stay off of the ethernet a meaningful chunk of the time? you have to pound that ethernet if you’re going to sort across lots and lots of hosts. it’s just plain how it has to happen. we’re not talking about an arbitrary computation, we’re talking about a sort. right?

s.

May 20, 2010 8:09 pm

steve,

In fact, we are talking about arbitrary calculations. Google didn’t build the system just to do this one sort. It’s the same engine that services search queries, analytics for AdSense, and just about everything else they do. In every map-reduce calculation you win big whenever you can stay entirely on-node for the map phase. Striping would guarantee that that can’t happen, ever. Beyond the latency issue, the congestion would quickly get out of hand if you had to do several off-node reads for each block of data you wanted to load.

rjlipton,

I assumed ethernet because, to the best of my recollection, that is what they actually use. Am I remembering wrong?

4. May 20, 2010 11:49 pm

Thanks for all the technical discussion of data tradeoffs and load balancing, before I even got time to add a technical comment/footnote to the article. Here are the issues we see—and they’re more general than “just chess” insofar as this embodies alternating computation:

1. The standard basic tablebase generation algorithm has the virtue that its running time does not depend on the depth-to-mate/conversion/etc., i.e. does not depend on the alternating running time t. It also avoids generating many non-winning White-to-move positions at all—only those reached by the lex-least saving move from some Black-to-move position get polled. It works backwards from “accepting configurations”—under DTM this means checkmates.

2. Each round of the standard algorithm works backwards from Black-to-move positions r newly found to win for White in the previous round.
2a. Every White-to-move position q that can go to r is marked winning for White (q may have been so marked before)
2b. If q is newly marked, then every Black-to-move position p that can go to q is generated.
2c. For each such p, if p is known to lose for Black, it is skipped.
2d. Else every other q’ that p can go to is checked, and if all lose for Black, then p is marked a newly-found win for White.

Our idea is to have step 2a. make a temporary list “HL” of such q, tolerating duplicate entries until a final sort removes them. Then replace steps 2b—2d by an iteration over all White to move positions q” that are unknown, rather than the newly-known ones. Thus we also maintain lists WU and BU for White Unknown/Black Unknown, respectively. In place of 2b–2d we have:

3. Iterate over q” in WU.
3a. For each p” that can go to q” in one Black move, add p” to a temporary list “GL” .
3b. Sort GL, remove duplicates, and then merge BU with GL to get BU \ GL. Those are the “Black to move positions newly found to win for White.”

This works, and now step 3a. is symmetric with step 2a. But—

The first egregious problem from the standpoint of conventional sequential time is that each BU position gets “touched” a number of times proportional to t, besides the number of times it is generated on backtracks from White-to-move positions.

The second problem is that without the random-access checks, we add duplicates to the HL and GL lists; thus HL stands for “Huge List”. We observe that the loops over q and p” above allow new information to be propagated even if they haven’t terminated yet. Thus we can interrupt them whenever HL and/or GL balance the sizes of other lists they merge with.

Thus we are trading off factors of t and the size of BU against avoiding the random accesses. Is this worth it? Before we program experiments, we’d like to know if the ideas seem reasonable and have a chance of succeeding in “cloudland”…

• May 21, 2010 12:10 pm

Where I wrote “We observe that…” in the 2nd-last paragraph, I meant, “We mitigate this by observing that…” It comes down to a different kind of load-balancing.

May 24, 2010 8:01 am

“Ken Regan suggests that it (the size of the table file) grows by a factor of about 70 for each additional piece.” Isn’t this factor always something <= 64? Why the 70?

May 24, 2010 9:37 am

Need to include square piece goes on and what the piece is?

May 24, 2010 2:05 pm

Oh, I thought this was talking about a single table. In my use of that term, that means the piece types are fixed. I generate each set of pieces by themselves.

• August 13, 2010 8:37 pm

The “70” figure is hazy, but the reason it can be bigger than 64 is that adding a Pawn can break symmetries that were used to halve the encoding of the smaller endgame.

Estimating the total size of the files, say for the “Platonic” 8-piece set, is an interesting exercise. Even apart from symmetry-breaking, the way I see to avoid double-counting still has “slices” that go up by a factor of more than 64 (or 64 – 7 = 57). This is to recurse by removing the “most significant piece” from a position, say in order White QRBNP, then Black qrbnp. If we have a 7-piece position with KBNN vs. krr, it can be the parent of 8-piece positions with an extra White Bishop, Rook, or Queen. This is offset, however, by an extra Pawn being able to go on at most 48 new squares, so I expect the ratio to be in the 50’s as you say after all.

August 13, 2010 10:46 pm

Ahh, I hadn’t thought about the symmetry change. I generally think about pawn endgames completely separately from pawnless and n-piece endgames separately from (n+1)-piece.

But thinking about them together, let’s for a moment assume that you don’t remove squares that are already occupied (so we’re at 64*64*64… instead of 64*63*62…). In pawnless endgames you can always get 10/64 compression from symmetries by flipping the first piece into the a1-d1-d4 triangle. (The first piece has 10 possible squares instead of 64.) So a 3-piece pawnless table will be 10*64*64. If we add a pawn, we only have the vertical symmetry, so the new table size will be 32*64*64*48, which is a factor of just over 153 bigger than the 3-piece pawnless. But this jump only happens when you add the first pawn. After that, you’re back down to a factor of 48 for each new pawn.

I guess it would take awhile for this large factor to be averaged below 64, although I tend to think the precise value isn’t all that important. Sorry for nitpicking, and thanks for the explanation.

June 19, 2011 12:52 pm

The technique of sorting during tablebase construction was already used in a 2006 paper An External-Memory Retrograde Analysis Algorithm, Ping-hsun Wu, Ping-Yi Liu and Tsan-sheng Hsu