# Truth From Zero?

* How we might compare AlphaZero against standards of perfection *

YouTube 2015 lecture source |

David Silver is the lead author on the paper, “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” which was posted twelve days go on the ArXiv. It announces an algorithm called AlphaZero that, given the rules of any two-player game of strategy and copious hardware, trains a deep neural network to play the game at skill levels approaching perfection.

Today I review what is known about AlphaZero and discuss how to compare it with known instances of perfect play.

AlphaZero is a generalization of AlphaGo Zero, which was announced last October 18 on the Google DeepMind website under the heading “Learning From Scratch.” A paper in *Nature*, with Silver as lead author, followed the next day. Unlike the original AlphaGo, whose victory over the human champion Lee Sedol we covered, AlphaGo Zero had no input other than the rules of Go and some symmetry properties of the board. From round-the-clock self-play it soon acquired as tutor the world’s best player—itself.

The achievements in Go and Shogi—the Japanese game whose higher *depth* in relation to Western chess we discussed three years ago—strike us as greater than AlphaZero’s score of 28 wins, 72 draws, and no losses against the champion Stockfish chess program. One measure of greatness comes from the difference in Elo ratings between the machine and the best human players. AlphaGo Zero’s measured rating of 5185 is over 1,500 points higher than the best human players on the scale used in Go. In Shogi, the paper shows AlphaZero zooming toward 4500 whereas the top human rating shown here as of 11/26/17 is 2703, again a difference over 1,500. In chess, however, as shown in the graphs atop page 4 of the paper, AlphaZero stays under 3500, which is less than 700 ahead of human players.

## Bumping Against Perfection

Although AlphaZero’s 64-36 margin over Stockfish looks like a shellacking, it amounts to only 100 points difference on the Elo scale. The scale was built around the idea that a 200-point difference corresponds to about 75% expectation for the stronger player—and this applies to all games. Higher gains become multiplicatively harder to achieve and maintain. This makes the huge margins in Go and Shogi all the more remarkable.

There has been widespread criticism of the way Stockfish was configured for the match. Stockfish was given 1 minute per move regardless of whether it was an obvious recapture or a critical moment. It played without its customary opening book or endgame tables of perfect play with 6 or fewer pieces. The 64 core threads it was given were ample hardware but they communicated position evaluations via a hash table of only one gigabyte, a lack said to harm the accuracy of deeper searches. However hobbled, what stands out is that Stockfish still drew almost three-fourths of the games, including exactly half the games it played Black.

I have fielded numerous queries these past two weeks about how this affects my estimate that perfect play in chess is rated no higher than 3500 or 3600, which many others consider low. Although the “rating of God” moniker is played up for popular attention, it really is a vital component in my model: it is the -intercept of regressions of player skill versus model parameters and inputs. I’ve justified it intuitively by postulating that slightly randomized versions of today’s champion programs could score at least 10–15% against *any* strategy. I regard the ratings used for the TCEC championships as truer to the human scale than the CCRL ratings. TCEC currently rates the latest Stockfish version at 3226, then 3224 for Komodo and 3192 for the Houdini version that won the just-completed 10th TCEC championship. CCRL shows all of Houdini, Komodo, and an assembly-coded version of Stockfish above 3400. Using the TCEC ratings and the official Elo “*p*-table” implies that drawing 2 or 3 of every 20 games holds the stronger player to the 3500–3600 range.

Of course, the difference from Go or Shogi owes to the prevalence of draws in chess. One ramification of a post I made a year ago is that the difference is not merely felt at the high end of skill. The linear regressions of Elo versus player error shown there are so sharp () that the -intercept is already well determined *by the games of weaker players alone*.

Overall, I don’t know how the AlphaZero paper affects my estimates. The Dec. 5 paper is sketchy and only 10 of the 100 games against Stockfish have been released, all hand-picked wins. I share some general scientific caveats voiced by AI researcher and chess master Jose Camacho-Collados. I agree that two moves by AlphaZero (21. Bg5!! and 30.Bxg6!! followed by 32.f5!! as featured here) were ethereal. There are, however, several other possible ways to tell how close AlphaZero comes to perfection.

## Some Possible Experiments

One experiment is simply to give AlphaZero an old-fashioned examination on test positions for which the perfect answers are known. These could even be generated in a controlled fashion from chess endgames with 7 or fewer pieces on the board, for which perfect play was tabulated by Victor Zakharov and Vladimir Makhnichev using the Lomonosov supercomputer of Moscow State University. **Truth** in those tables is often incredibly deep—in some positions the win takes over 500 moves, many of which no current chess program (not equipped with the tables) let alone human player would find. Or one can set checkmate-in- problems that have stumped programs to varying degrees. The question is:

With what frequency can the trained neural network plus Monte Carlo Tree Search (MCTS) from the given position find the full truth in the position?

The trained neural network supplies original probabilities for each move in any given position . AlphaZero plays games against itself using those probabilities and samples the results. It then adjusts parameters to enhance the probabilities of moves having the highest expectation in the sample, in a continuous and recursive manner for positions encountered in the search from . The guiding principle can be simply stated as:

“Nothing succeeds like success.”

We must pause to reflect on how clarifying it is that this single heuristic suffices to master complex games—games that also represent a concrete face of asymptotic complexity insofar as their size *n*-by-*n* generalizations are polynomial-space hard. A famous couplet by Piet Hein goes:

Problems worthy of attack

prove their worth by hitting back.

It may be that we can heuristically solve some NP-type problems better by infusing an adversary—to make a PSPACE-type problem that hits back—and running AlphaZero.

As seekers of truth, however, we want to know how AlphaZero will serve as a guide to perfection. We can regard a statement of the form, “White can win in 15 moves” (that is, 29 moves counting both players) as a theorem for which we seek a proof. We can regard the standard *alpha-beta* search backbone as one proof principle and MCTS as another. Which ‘logic’ is more powerful and reliable in practice?

A second way to test perfection is to take strategy games that are just small enough to solve entirely, yet large enough that stand-alone programs cannot play perfectly on-the-fly. One candidate I offer is a game playable with chess pawns or checkers on a board with 5 rows and columns, where perhaps can be set to achieve the small-enough/large-enough balance. I conceived this 35 years ago at Oxford when seemed right for computers of the day. The starting position is:

Each player’s pawns move one square forward or may “hop” over an opposing piece straight forward or diagonally forward. If some hop move is legal then the player must make a hop move. The hopped-over piece remains on the board. If a pawn reaches the last row it becomes a king and thereupon moves or hops backwards. No piece is ever captured.

The goal is to make your opponent run out of legal moves. If a king reaches the player’s first row it can no longer move. This implies a fixed horizon on a player’s count of moves. The trickiest rules involve double-hops: If a single hop and double hop are available then the double hop is not mandatory, but if a pawn on the first row begins a double hop it must complete it. Upon becoming a king after a hop, however, making a subsequent return hop is optional, except that a king that makes the first leg of a returning double hop must make the second leg. A final optional rule is to allow a king to move one cell back diagonally as well as straight.

From the starting position, White can force Black to hop four times in four moves by moving to a3, a4, d3, and d4. Then White still has the initiative and can either make a king or force another hop; the latter, however, forces White to hop diagonally in return. This seems like a powerful beginning but the subsequent Black initiative also looked strong. My playing partners at Oxford and I found that positional considerations—making the opponent’s pieces block themselves—mattered as much as the move-count allowance. This made it challenging and fun enough for casual human play, but we knew that computers should make quick work of it.

The point of using this or some other solved game would be to compare the strategy produced by the AlphaZero procedure against perfection—and against programs that use traditional search methods.

## Open Problems

What do you think are the significances of the AlphaZero breakthrough?

### Trackbacks

- Machine Learning « Pink Iguana
- The great Ken Regan on AlphaZero - Marginal REVOLUTION
- The great Ken Regan on AlphaZero | Me Stock Broker
- The great Ken Regan on AlphaZero – Courtier en Bourse
- Follow up, links | The Tao of Gaming
- Curated Insights 2017.12.24 - Bursa Digest
- Predictions We Didn’t Make | Gödel's Lost Letter and P=NP
- secret/ blueprint/ path to AGI: novelty detection/ seeking | Turing Machine

Good post, thanks.

Vanilla checker would probably be a good instance of game where perfect play is possible, yet that possess a search space that is both large and deep enough to yield interesting strategies.

I have been suggesting that the new AI try its magical powers on NIM, where we know how to play perfectly, and rather simply. Scott Aaronson send a related link https://arxiv.org/pdf/1711.02301.pdf

Also there are a lot of interesting NIM variants where I think(?) we don’t know how to play perfectly. E.g.,

1. Players are allowed to take one or more stones from either one row

or two rows. As usual, the last person take any stones wins. (And the

variant where the last person take any stones loses).

2. Stones are originally placed on an n x n chessboard. Players are to

take one or more stones from either a row or a column.

3. Same as 2 except rows, columns, diagonals.

4. Multiple forms of 2,3, as in 1.

Lol, I’ve also asked about NIM on the below link! And that’s an interesting paper – surprising that neural networks couldn’t figure out the perfect play.

I wonder how AlphaZero could solve classic problems for which we know an algorithm, like Dijkstra: https://cstheory.stackexchange.com/questions/39715/can-neural-networks-be-used-to-devise-algorithms

I see that people complained about a fixed time limit of one minute per move, but how do they come to the conclusion that the limit was fixed? The paper just says “…, playing

100 game matches at tournament time controls of one minute per move.” I cannot even find the passage in the paper where it says that stockfish “played without its customary opening book or endgame tables”. The thing with the hash table of only one gigabyte for 64 threads is there and was probably a mistake (because stockfish is not tested well at that configuration and one gigabyte is apparently too small for 64 threads), but I doubt that it was intentional. I have rather the impression that it is a typical case of not asking the right people for how to correctly use a piece of software under extreme conditions (and not even being aware that one creates extreme conditions).

” I have rather the impression that it is a typical case of not asking the right people for how to correctly use a piece of software under extreme conditions”.

Gentzen, your sentence is also true with the modification “under normal conditions”, i.e.

” I have the very strong impression that it is a typical case of not asking the right people for how to correctly use a piece of software under even normal conditions”.

I think there’s a lot of evidence that Stockfish was playing well below it’s usual strength, just from the moves it made.

I’m convinced that “one minute per move” does mean a fixed one minute per move, and that they probably used this because they didn’t seem to realize how weakening this was. If they really meant something other than a fixed minute per move, then the correct terminology is NOT “one minute per move”, but rather e.g. “40 minutes for 40 moves”. If that’s what they meant, it would be more evidence that they really have no idea about computer chess.

It seems to me that the notion of perfect play in the game theoretic sense doesn’t really map usefully to the notion of ELO scores for the following reason.

To maximize ELO score one wouldn’t play in the game theoretically perfect fashion. Rather, one would play to maximize the chances of forcing any imperfect player to make a mistake. For instance, if one knew that one was playing against stockfish with certain config options one could simply simulate all possible games and look for the ones in which you win.

Now even if you don’t know the program or person one is playing against it still seems one would maximize ELO rating by simply playing in a way that maximizes the chance of victory given a certain probability distribution on the space of all possible programs (say equipped with a random oracle).

—

Even if one is trying to give an ELO score to play that is required to be game theoretically perfect it seems this will depend greatly on whether perfect play forces very particular moves at all times or if perfect play allows a great deal of options (i.e. many decision points which are all compatible with perfect play). In the later case those options give more freedom to both play perfectly and force the opponent into errors.

Thanks. Your second paragraph is why the Stockfish would be randomized and your third paragraph is what I am postulating about when I say “any strategy” under a randomized analysis. My core postulate is that the randomized Stockfish 8 (say on 4 CPUs) would hold 10-15%. It can be refuted by an “AlphaZero+” that scores over 90%. The Elo assertion is derivative to that. Referring also to your second comment, my formulation avoids making “perfect play” into a thing. [It also gives some compensation for the “rating of God” irreverence by my being able to say that “God” is a universal quantifier 🙂 ]

Thanks for the clarification!

I’d assumed when you meant randomized analysis you meant something like letting stockfish choose its next move with probability somehow based on the score it gave to that move but now I think I misunderstood you.

It would come out that way: If two moves have the same best value it might flip a coin; if the best move is ahead by 0.20, say, it would always play it. But this would be arbitrary (introducing “noise”) to avoid strategies homing in on positions where Stockfish (with yea-much time budget etc.) would be presciently fated to fail. This is in particular completely apart from the theories by which my model and AlphaZero respectively ascribe probabilities to each legal move in a position.

I am quite puzzled by these advancements in AI and machine learning.

These algorithms, like alphaZero, that teaches themselves, can they also be implemented not for strategic games, but for any algorithmic task? That is, for solving NP hard problems?

Or for solving mathematical problems?

Thanks. I tried to address exactly that point with the Piet Hein interlude in the third section.

To put the point differently consider what you are saying if you say that perfect play only has a rating of 3600 or so.

Suppose, as seems reasonable, that white wins under perfect play and perfect play is as narrow as possible, i.e., an every move there is only a single play for white that doesn’t hand the game to black. If stockfish scores a 3400 you are suggesting that 50% of the time (assuming some randomized version of stockfish with moves choosen randomly based on weights derived from move scoring) when playing black this version of stockfish plays EVERY LAST MOVE PERFECTLY. That seems highly unlikely given that chess engines haven’t converged to virtually unanimous agreement about the best move in almost every situation during a high level game.

Moreover, the perfect player (as black) has virtually unlimited discretion to push stockfish or any other engine into situations it will likely make an error. A perfect player could simulate all possible algorithms weighted by total instructions needed to execute per game to maximize the chance that the opponent will make a mistake.

—-

Seems to me that while 3600 might be a good estimate for the limiting value of POSSIBLE chess engine play things suddenly go discontinuous if we assume actual perfect play and unbounded resources.

A thought exercise: what if someone had developed this algorithm 15 years ago? Would the asymptotic patterns look fundamentally different? Eg would the elo of AlphaZero still converge (to a lower value, of course)? And it’s possible it may score similarly against the best engine of 2007 (Deep Fritz?). If these patterns could conceivably been the same on 2007 hardware, then I don’t think we can draw conclusions about the absolute strength of AlphaZero. On the other hand, if you think there would have been a a fundamental difference, what would it have been?

Ken, have you considered running the AlphaZero games through your cheating verification program? As I understand it, this will determine a likelihood for each move on a scale from human-like to computer-like. I’m curious where the non-trivial AlphaZero moves rate on this scale.

AlphaGoZero seems to have discovered some new stronger ways to play around certain patterns in Go. I wonder if some useful discoveries in chess will come from AlphaZero. (e.g., a sequence of questionable moves that lead to a strong position or material advantage)

I ran the 10 available games through the “screening test” with both the quick setting I use for human games (about 10-15 min. per game on one core thread) and a higher setting that guaranteed depth at least 25 (Komodo 11.2) or 27 (Stockfish 8) and at least 100 million nodes searched (but taking 5-7 hours per game). What it gives is a kind of inner product between AlphaZero and the 64-core Stockfish 8 on one hand, and the single-core Stockfish 8 and Komodo on the other, in which the former are really judging the latter. Both lower-depth engines had higher % agreement with the 64-core SF8 (naturally the two Stockfish configurations agreed the most), but interestingly, both said AlphaZero had a lower average error per move than the 64-core SF8 (maybe just because AlphaZero won all the games). The two higher-depth tests each gave 48 on my “Raw Outlier Index” where 50 is the expectation for one’s rating, where I used 3226 for the 64-core SF8 and 3326 for AlphaZero, while the lower-depth Komodo gave 47 but the lower-depth Stockfish only 38. Overall I considered the data from 10 games too paltry to mention in the post.

Ken, why did you consider the data from 10 games too paltry to mention in the post? Going by Michael’s description, I assume you use it to look at an individual’s games in a single tournament and that might be about 10 games, each probably shorter than many of the ones in the match.

And what was the % agreement between the higher-depth test Stockfish 8 & the Stockfish 8 that played against AlphaZero? How do you define the “% agreement”? (That’s already unclear to me, let alone the definition of the “Raw Outlier Index”. Is it detailed somewhere you can point to?)

My cheating tests do not regress on a player’s games—they take parameters corresponding to the rating that were obtained by regression on large data. However, my “Intrinsic Public Rating” values do regress on the small data of one player—and typically produce error bars of +- 200 or more over 9-10 games. The bars narrow toward the high end, but that’s offset by apparent loss of resolution above 3100 which is why I also needed higher depth. Thus 10 games are just too few to reach any conclusions within 100 Elo, let alone that their selection was biased.

“Agreement” means that both engines give the same move as their first line. Often two or more moves are tied for top value, but which one the program lists first is still highly significant—see this old post for some of why. Using Komodo 11.2 to depth at least 25 (at least 100 million nodes) produced agreement with AlphaZero of 294/ 494 = 59.5% and average error 0.019 per move, as opposed to 339/ 495 = 68.5% agreement with Stockfish but higher average error 0.027 per move. Using Stockfish 8 in my depth>=27 && nodes>=100M mode as the “referee” yielded agreement 265/ 463 = 57.2% and average error 0.023 vis-a-vis AlphaZero, and agreement 331/ 459 = 72.1% with average error 0.024 versus the more-powerful Stockfish 8. The lower denominators are because Stockfish 8 gave values outside [-3.00,+3.00] for more positions than Komodo 11.2 did. The ROI is a composite of the agreement and average-error measures; it is

ad hocto my screening tests.⭐ 😀 ❗ 💡 ❤ world class stuff from deepmind, ahead of anyone else by far. the idea of working with algorithms that are "less and less preconfigured/ supervised" seems to me dead on track and wonder if it will lead to general AI in "not so long". have a wild idea for the secret of strong AI but this comment is too small to contain it. if theres enough demand on my blog, will post it there. seems to be very low risk of having to carry out that. 😐

but anyway have been musing that there seem to be no technical consultants or architects in these companies steering a general strategy. they seem to be just hiring as many deep learning/ machine learning experts they can find and hope for the best, that one of them will be the next einstein of AI. suggest companies consider taking consulting advice/ ideas/ suggestions from nontraditional sources and even better give some possibility of remuneration for winning direction proposals. not sure exactly how to go about this myself… maybe an essay contest? 🙂

Hi, Few more experiments for Alpha Zero.

1. Knowing that it uses a deep network and MCTS for self play. Only one trajectory is published in the research paper, and the paper does not show that random restarts lead to the same Games against stockfish, and not the win ratio. This will provide evidence, that AlphaZero is not getting stuck in a local optima, and it really is playing perfectly ( Global Optimum ).

2. What happens if the training hyper parameters, like number of MCTS rollout are changed gradually.

3. Stockfish has been used for identifying the best player of all time. Ref : https://content.iospress.com/articles/icga-journal/icg0012 . The study should be repeated with the best optimum of AlphaZero.

4. what we also need to see is the asymptotics of improvement, how many self play games are required at each iteration, to win 55% of games with current version. It seems, Deepmind, has stopped the training for one set of parameters, without any evidence that it is not a local optimum and it can still improve, though it is exponentially harder. since they are not increasing the minibatch.

Some comments on the post and comments:

> This makes the huge margins in Go and Shogi all the more remarkable.

It’s more meaningful for Shogi than Go. Keep in mind that these are all trained with different amounts of compute, possibly *very*. Zero reaches roughly human pro level in ~16 hours wallclock, going by the first paper, but Zero is then further trained for almost a month and still hadn’t converged, IIRC, while the chess/Shogi agents are turned off almost as soon as they hit parity (I suspect they may well have been doing early stopping and the current short draft 2nd paper omits this). With that in mind, it seems like the intrinsic difficulty rating is something like shogi < chess With what frequency can the trained neural network plus Monte Carlo Tree Search (MCTS) from the given position find the full truth in the position?

Possibly not important but Zero isn’t actually using MCTS for regular playing. The CNN on its own is so strong that to play a full match, they just do a simple tree search down a few ply and that’s how they choose. (This is part of why the hardware requirements for playing drops so massively from AlphaGo Sedol to Zero.) The MCTS is for evaluating each position to provide improved value estimates of each possible move to retrain & finetune the CNN’s predictions.

> I have been suggesting that the new AI try its magical powers on NIM, where we know how to play perfectly

I don’t know if you’re one of the people asking this on Reddit too, but IIRC, one Redditor mentions implementing expert iteration and it does play Nim.

> These algorithms, like alphaZero, that teaches themselves, can they also be implemented not for strategic games, but for any algorithmic task? That is, for solving NP hard problems? Or for solving mathematical problems?

MCTS, and thus expert iteration using MCTS, is for any Markov decision process (MDP) with a model of the dynamics available. It is not limited to 2-player board games. If a problem has a state which determines the next state, well-defined rewards, and actions, then it can be considered as a MDP. Theorem proving, for example, is clearly a MDP: your state is all available axioms/theorems, your actions are rules of inference, your reward is 0 for intermediate theorems and 1 for the final goal theorem, and you can easily simulate it in a computer exactly and machine-check proofs. Jeff Dean’s NIPS talk explains how Google intends to turn all sorts of things like circuit design and datacenter optimization and network optimization and compiler stuff into MDPs which can be tackled using deep RL. People have also been trying (original) AlphaGo-like and other NN variants on theorem proving and it seems to work better than previous methods (nothing revolutionary yet, but maybe they haven’t figured out the best way to encode things for the NN or applied enough compute).

> Knowing that it uses a deep network and MCTS for self play. Only one trajectory is published in the research paper, and the paper does not show that random restarts lead to the same Games against stockfish, and not the win ratio.

As I said, Zero does not use MCTS for normal playing against Stockfish etc; the tree search is (or at least should be) deterministic as it does nothing like random rollouts or posterior sampling. There are no ‘random restarts’ for it unless you deliberately add noise and change its moves.

> What happens if the training hyper parameters, like number of MCTS rollout are changed gradually.

The papers don’t discuss the hyperparameter sensitivity and given the expense, DM might not have done much hyperparameter search. On the other hand, we can guess that it’s probably pretty stable: MCTS is not known as a finicky algorithm, Anthony et al 2017 don’t mention their expert iteration needing much hyperparameter tuning, Zero was simplified by simply throwing stuff out, DM papers usually mention it if extensive gridsearch was necessary, and the genius of expert iteration is scrapping policy gradients & deep RL in general in favor of effectively turning it into a simple CNN supervised learning problem so the stability *ought* to be much better.

An excellent summary of DM’s contributions. Your gist is just at the right level to incite interest and your references are valuable for further reading. Thank you gwern for sharing this.

While rechecking Silver et al 2017a/b for the chess raw performance, I noted the hyperparameter bits:

“The neural network architecture (see ‘Neural network architecture’) is based on the current state of the art in image recognition 4,18 , and hyperparameters for training were chosen accordingly (see ‘Self-play training pipeline’). MCTS search parameters were selected by Gaussian process optimization 68 , so as to optimize self-play performance of AlphaGo Zero using a neural network trained in a preliminary run. For the larger run (40 blocks, 40 days), MCTS search parameters were re-optimized using the neural network trained in the smaller run (20 blocks, 3 days). The training algorithm was executed autonomously without human intervention.”

“AlphaGo Zero tuned the hyper-parameter of its search by Bayesian optimisation. In Alpha Zero we reuse the same hyper-parameters for all games without game-specific tuning. The sole exception is the noise that is added to the prior policy to ensure exploration (29); this is scaled in proportion to the typical number of legal moves for that game type. Like AlphaGo Zero, the board state is encoded by spatial planes based only on the basic rules for each game. The actions are encoded by either spatial planes or a flat vector, again based only on the basic rules for each game (see Methods). We applied the AlphaZero algorithm to chess, shogi, and also Go. Unless otherwise specified, the same algorithm settings, network architecture, and hyper-parameters were used for all three games.”

In other words, they claim the CNN is as stable and insensitive to hyperparameters as any ol’ image classification CNN, which is very (at least compared to deep RL!), and they can reuse the same hyperparameters across multiple games as proof.

Gwern, our thanks too for helpful information. I was just writing this when your comment below referencing the “Search algorithm” section of the

Naturepaper came thru just now—the point having still been unclear to me even after reading that section myself. I’ll put the rest of this reply there.> Knowing that it uses a deep network and MCTS for self play. Only one trajectory is published in the research paper, and the paper does not show that random restarts lead to the same Games against stockfish, and not the win ratio.

As I said, Zero does not use MCTS for normal playing against Stockfish etc; the tree search is (or at least should be) deterministic as it does nothing like random rollouts or posterior sampling. There are no ‘random restarts’ for it unless you deliberately add noise and change its moves.

> I think, the tree search becomes deterministic only in test time. table S3 under configuration section, mentions 800 MCTS sims, for thinking during training. Since the network is initialized with random weights. I was wondering, if running the algorithms multiple times will give different trajectories, due to 800 MCTS sims ( rollout ).

As a deep learning noob, I don’t understand why Alpha Zero is limited to two-player games like go, shogi, and chess. Is it just that these are the low hanging fruit for the AI engineers to tackler first? Or is there a theoretical reason why multiplayer games like Texas hold ’em or chinese checkers will be significantly harder to tackler using GANs than two-player games?

It’s not limited to two-player perfect-information games, those are just the Drosophilia of RL research (and Google doesn’t limit itself like that – optimizing datacenter cooling is not a two-player game unless you define Nature as a player). It’s for MDPs, see my comment.

As for your examples: Texas hold’em is not a MDP because there is hidden information (players’ cards and the deck), which makes it a POMDP; you could turn it into a MDP by telling everyone all the cards but people might not be too impressed if you came up with an algorithm which could learn to play that well. 🙂 I don’t see any reason why Chinese checkers wouldn’t work fine with expert iteration aside from the fact that no one cares about it. (There are not so many GPUs and deep RL experts out there that every new research method will be applied to every game, after all.)

How does AlphaZero compare to the evolved neural networks for checkers and chess that David Fogel was doing in the late ’90s / early 2000s? He had some decent successes with relatively small scale computing resources, but I haven been able to figure out if and how AlphaZero relates to David’s stuff.

Also, if I understand correctly, the neural network (as essentially a sub-routine, if you will, of MCTS) basically acted like a engine in that it would try to pick the best move from a given position, but much faster than a typical chess engine because it just needed to basically do some tensor arithmetic.

So here’s a question. How good is the neural network at chess itself, without being embedded in the MCTS?

I don’t believe the chess/shogi paper reports it in detail like that. They do discuss this for the Go player, though. See the Silver paper on Zero, pg 5 or so. Figure 6 shows a graph of ELOs. The raw network without any tree search is ~3000 ELO run on a single machine, equivalent to AlphaGo Fan (with tree search) run on 176 GPUs:

“We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee and several previous Go programs. We also played games against the strongest existing program, AlphaGo Master—a program based on the algorithm and architecture presented in this paper but using human data and features (see Methods)—which defeated the strongest human professional players 60–0 in online games in January 2017 34 . In our evaluation, all programs were allowed 5 s of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUs and 48 TPUs, respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability.

Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.”

Also, again, Zero does not use MCTS when playing, it uses a simpler tree search without any random rollouts – just the NN’s evaluation as the terminal estimate because it’s so good. Pg8, “search algorithm”:

“Compared to the MCTS in AlphaGo Fan and AlphaGo Lee, the principal differences are that AlphaGo Zero does not use any rollouts; it uses a single neural network instead of separate policy and value networks; leaf nodes are always expanded, rather than using dynamic expansion; each search thread simply waits for the neural network evaluation, rather than performing evaluation and backup asynchronously; and there is no tree policy. A transposition table was also used in the large (40 blocks, 40 days) instance of AlphaGo Zero.”

(This is of the first Zero Go agent but presumably it applies as well to the other later ones.)

One thing I remain confused by in the

Naturepaper is the use of “sample” in the statement near the beginning, “Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts.” In the “Search algorithm” section, Figure 2 and the discussion of “temperature” still make me ask whether there is some randomization. I do understand that what is being propagated back from the NN evaluation are vectors of points expectations (“probabilities”) rather than position values under minimax.It is certainly significant that a raw network with no search—figuratively a “potential field” that can be evaluated for any game position—can be so strong already. So the final step is a relatively shallow/narrow search (80K x 60 = only 4.8 million nodes in one minute of playing) over the potential field that is completely deterministic? I wonder what size the encoded NN has…

It is a little confusing: why does the playtime tree search need any sampling since it is using a UCT which typically is deterministic rather than, say, a posterior sampling approach like Thompson sampling where one of the advantages is that it is stochastic (which is good for parallelized search or when feedback is delayed)? The evaluator section says that the MCTS temperature is set to 0 to “give the strongest possible play”, which makes sense, and the self-play section says that for training it initially uses a high temperature but then sets it back to 0; so you would think that for regular play, the temperature would also be set to 0. But then why does the play section talk about having a tau temperature parameter that ‘controls the level of exploration’?

I think there might be two different temperatures: one temperature is using in playing out training games and picking moves after MCTS evaluation is finished, to explore new board states; the second temperature is a (constant?) temperature used *inside* the simple tree search for deciding where to evaluate deeper, but the move itself is selected simply by a max().

> I wonder what size the encoded NN has…

Combining pg8’s description with http://mohanadkaleia.com/convcalc/ and https://cs231n.github.io/convolutional-networks/#layersizepat (19×19 input with 17 channels, stride 1, 256 filters, 3×3 spatial extent, times 19 layers, times 4 bytes for each parameter), it looks like the first 19-layer Zero is 124MB+ and then the full 39-layer might be 255MB+? Omitting the final layer and batchnorms, not sure how they add. (I don’t think resnets have any additional learnable parameters for the residual connections, it’s just addition.)

On the second point, wow. I’ve mused about the idea of “compressing” the huge endgame tables into a relatively small set of evaluated positions P such that today’s chess programs could play perfectly with ordinary search using just P for lookup. I thought maybe P could be 1 TB in place of the Lomonosov 100+ TB. But to get a trained oracle for all of chess into less than a gigabyte giving superhuman play at zero depth—?—an incredible existential fact.

Viewing also the first point, I’d be keen to try it on positions that are 10 or so ply upstream of a classic case where a Bxh7+ combination works with White’s a-pawn on a2 but not on a3. I believe there are sharper examples than the one I found discussed here by IM Jeremy Silman.

Somewhere on the order 300MB is just my best guess (although consistent with ImageNet model sizes, at least); but whatever the current NN, one can probably cut the size down by ~10x using model distillation/compression techniques. NNs are always heavily overparameterized because for some reason the small NNs can’t be trained directly, but you can train a small NN to imitate a large one highly accurately. Decreases of 10x or more are far from unheard of. Aside from being easier to store & transmit, and being able to run more instances on a single GPU, the small NNs also benefit from being faster (fewer parameters to compute) and can have lower latency (if you use a wider shallow net to imitate a deep one like 39-layers, replacing the inherently serial layer-by-layer computation by fewer but more parallel layers).

DM just gave us a nice demonstration of this with their voice synthesizer WaveNet: the original WaveNet has a hard time doing generation in realtime because there are just too many layers to compute serially through, so DM trained a wider shallower WaveNet to imitate the original WaveNet which comfortably works in realtime at similar or better quality and is computationally cheap enough to deploy to millions of people using Google Assistant. The paper: “Parallel WaveNet: Fast High-Fidelity Speech Synthesis” https://deepmind.com/documents/131/Distilling_WaveNet.pdf , van den Oord et al 2017 https://deepmind.com/blog/high-fidelity-speech-synthesis-wavenet/ . This would be a natural direction for further optimizing Zero models if DM had wanted to continue the project.

Actually, the sentence in the paper that most led me to infer that AlphaZero used MCTS while playing the games is in the middle of page 5 of the paper: “AlphaZero’s MCTS scaled more effectively with thinking time than either Stockfish or Elmo, calling into question the widely held belief (4, 11) that alpha-beta search is inherently superior in these domains.” Thus I—and some commenters on the TalkChess forum—thought the 80K nodes per second meant that MCTS was being started from these nodes. It would be more accurate for the paper to say, “AlphaZero’s MTCS created a neural engine that scales more effectively…”

Here’s a game on which I’d like to see AlphaZero tried: “guess the higher prime number”. Only one round per game, both players submit numbers at the same time. The higher prime number wins; a player automatically loses if they actually gave a composite number (unless both players did in which case it’s a draw). Since primes become more scarce as numbers become greater, the chances of hitting one purely by chance diminish as the opening plays become stronger. How far from the lower / easier wins would AlphaZero venture? Would it ever discover the underlying principle?

Tell me why this is wrong:

Reading “When Data Serves Turkey”, it looks like you measure player quality by comparing their moves to the best moves, *as determined by a chess engine”. But this isn’t really measuring how good their moves are, it’s measuring how similar they are to the engine used in the analysis–which among humans happens to be much the same thing, hence the rock-solid observed correlation with Elo. But this should break down with players better than (or equal to and different than) the analysis engine–wouldn’t AlphaZero’s !! moves be assigned a high raw error by analysis, despite being better moves than those considered optimal by the engine?

And it shouldn’t surprise us if the y-intercept suggests that a maximally-engine-like player will have the same Elo score as an engine!

Indeed, see my comment here above which says similar things and gives whatever meager results the 10 available games allow. Regarding your second query, there’s my point in the post that the y-intercept is already pretty well fixed by human players under 2000 (say) and at least the two strong engines with Linux versions I can script agree fairly closely on it.

“I regard the ratings used for the TCEC championships as truer to the human scale than the CCRL ratings.”

Ken, what is the basis for this belief? I actually wonder how to compare the ELO ratings for computers and humans, because they no longer play each other, so it’s basically two mutually exclusive pools of players.

Another question: does the initial, arbitrary assignment of an ELO rating before players have played a game have any effect at all on final ratings?

My basis is several indicators pointing to perfection being under Elo 3600, some mentioned in the post and my comments above. Updates to my work with stronger engines and deeper search have tended to push it under 3500. Now CCRL shows the assembler version of Stockfish at 3424 on a 4-CPU machine, so this is verging on paradox of machines being rated higher than perfection. I’ve had my attention drawn privately to how the top TCEC ratings have stuck around 3200 for unrealistically long, and it must be said that CCRL conducts so many games as to get a well-balanced distribution (avoiding effects of possible A > B > C > A type circularities in individual engine matchups). One way to resolve this experimentally would be first to make the “intrinsic” ratings of programs from around 2008 (in the CEGT matches in particular) on the human scale rock-solid, and then play matches between those programs and today’s. This article by Albert Silver is an instance of such a match, but only 6 games.

To answer the second question, the Elo scale is linear and relative, so the initial assignment does matter and you don’t get convergence like with Bayesian priors. The sports Elo ratings at FiveThirtyEight.com are centered on 1500. FIDE’s Elo ratings for Rapid and Blitz seem not to be anchored to anything, which is a present issue for me.

Thanks. So I guess you think the CCRL ratings are inflated by about 200 elo compared with the humans. I glanced at the SSDF list and the ratings seem maybe similar to the CCRL ones. They include old programs that may be like the ones that have played matches against humans.e.g. Deep Junior 13.3 x64 2GB Q6600 2.4GHz is rated 3115, Deep Fritz 13 rated 3097, Deep Fritz 11 3059. These numbers do seem high. But how much should one add back to the numbers to take into account more powerful hardware now than a decade ago?

My head says that the CCRL ratings are about 100 points too high, that is, the true Elos on 16-core 3.5Ghz machines are low/mid 3300s. My various indicators have until now kept pointing to 3200s. But just this week I’ve run a test with versions 4-5-6 of Houdini, 9-10-11 of Komodo, and 6-7-8 of Stockfish that seems to quantify real progress in a different way—what it is quantifying against is unknown and confidential, however. [All this is anyway modulo a combination of various faults of using lower-quality settings to do the measurements—and even if the arguments made by others here are effective, I still think my settings lose resolving power above 3100.] So I may have grounds to say the respective improvements all approach 100 Elo, which would put the later versions (on 16-core hardware) all above 3300.

I was thinking about “It may be that we can heuristically solve some NP-type problems better by infusing an adversary—to make a PSPACE-type problem that hits back—and running AlphaZero.” and found : https://arxiv.org/abs/cs/0210020, and wondering, if we can encode SAT to Tetris and run AlphaZero similar to Deep Q Learning on Atari Games.