Why it may take a ‘miracle’ to catch some cheaters

 source, where he’s right after Charles Dickens

John Littlewood really existed. He appeared as second author on so many papers with Godfrey Hardy that some believed him to be a fictional appendage. He kept on writing papers for a quarter century after Hardy’s death in 1947 and lived into his 90’s, passing away in 1977.

Today I wish to discuss “Littlewood’s Law” and its relevance to judging the incidence of cheating—at chess and in general.

Littlewood’s law has the informal statement,

Everyone witnesses a miracle every month.

Littlewood defined an event as a ‘miracle’ when one could reasonably say the odds against it were a million to one or higher. The logic of the “Law” works as follows:

• Each one of us tends to notice something distinctive about once per second.

• There are just under 2.6 million seconds in a 30-day month.

• If we are alert for 15 hours every day, that’s 1.62 million seconds per month.

• Hence we notice over 1.6 million distinctive events every month.
• Each event has a notion of a “normal” outcome, and hence a notion of deviation from normal.
• The deviation need not be numeric—it can be something like getting a photo caption tangled up in a quotation box as above—but it can be given numeric odds.

• Deviations in the million-to-one scale happen on average once every million times.

• With 1.62 million opportunities, the chance of seeing at least one miraculous event in a given month is about ${1 - (1/e)^{1.62}}$, which is over 80%. The occasional staid month is more than compensated by multiple miracles in others.

I have elaborated—really belabored—this more than Littlewood himself did. Littlewood’s point was just to argue that it is not unusual to see some unusual happenings. This conclusion does not require rigor. For chess cheating, however, I need to show where and when this reasoning carries through rigorously. My point will still be the same as Littlewood’s.

## Chess Cheating

Chess is a game of complete information. There are no cards to hide that might be palmed, switched, or played illegally, no dice that could be loaded. So how is it possible to cheat at chess? Alas the complete information can be conveyed to a computer, and thanks to the exponential increase in computer power and smarter chess-playing algorithms, consumer hardware can play better than any human. Hence cheating in chess in possible, and unfortunately this year it has seemed to become common.

Some players have been caught looking at PDA’s, or with a hearing device, or receiving signals, or with a computer-in-a-shoe. Recently the top-rated player in a tournament in Dortmund, Germany, was disqualified after his cellphone was found to emit code-like vibrations even when seemingly switched off. But in other widely-suspected cases, there is no solid physical or observational evidence at all. What can one do?

One can ramp up prevention, trying to make it impossible for a player to have direct or even indirect access to a computer. The trouble is that truly effective measures are too draconian or expensive to be used in large tournaments, while lighter ones can be evaded by clever schemers. Tournament officials have become more watchful, and yet cheating seems to be growing.

I have been working for years on a method whose beauty is that it requires no direct observation of the player, no monitoring during play, no draconian measures. As a negative statement its goal is simple:

Test whether the accused player’s moves correlate “too closely” with the moves of strong chess programs.

It comes, however, from positive research questions:

How closely do the moves of players of various rating levels correlate with those of our computer superiors? And how can we thereby measure skill based directly on the move decisions they make, instead of the sometimes-capricious results of games?

I have developed a skill-rating model and program that does statistical cheating detection as a by-product. It gives plausible results on even the strongest human players and computers themselves. The one person who confounds it, however, is John Littlewood—not the late well-known British player of that name, but our mathematician friend.

## Having to Break the ‘Law’

Testing for cheating from the game record is non-invasive, inexpensive, and can be used after the fact. It doesn’t involve body searches or EF jamming, both of which might break local laws. However, it meets a serious issue, one directly connected to Littlewood’s insight:

What if the accused player’s correlation with the program is simply a Littlewood miracle?

This is the central question to discuss in the remainder of this note.

The issue would be unavoidable even if I were not part of a ten-person committee created by the World Chess Federation (FIDE) and the Association of Chess Professionals (ACP) to combat cheating. Opposing players test games with computers all the time, in a scattershot manner, and they have not been shy about voicing machinations from their observations. I have dealt with even more cases that my data say clearly are crying-wolf; here are three in the public record. The problem occurs when the numbers are not so clear.

My work has been criticized for its putative inability to detect players who might cheat on only a few ‘critical moves’ per game, keeping purposeful deviation within the bounds ascribed to chance. If the numbers gave a reading of “insignificant” in such a case, well that would be the end of it. The problem comes in, and Littlewood’s Law strikes back, when the reading is some ways beyond the standard “two-sigma” threshold of significance.

I will take some sections for details and analogies, but here’s the bottom line for understanding. Not just for my program, but for many uses of evidentiary statistics to compute odds:

The computed odds are not the odds of cheating, but rather best-estimates of the frequency with which the policy of applying a sanction based on statistical results must expect to be erroneous.

That’s the frequency Littlewood is talking about.

## Some Details

Based on the rating of a player ${P}$, and on whether the positions faced tended to have few or many reasonable choices, my program generates projected probabilities for each move. Given any sequence ${C}$ of distinguished moves in each of those positions, my program thus computes both an expected number ${m_C}$ of times that ${P}$ will choose the move in ${C}$, and a standard deviation ${\sigma_C}$ for that statistic. The latter is based on viewing each turn as an independent Bernouilli trial over the probabilities generated for each possible move in that position, but includes an empirically-tested adjustment for dependence between consecutive moves that might be parts of a single strategy—this is a kind of “sparse dependence.”

Once ${m_C}$ and ${\sigma_C}$ are in hand, the actual number ${p_C}$ of moves in ${C}$ that ${P}$ played generates a z-score:

$\displaystyle z = \frac{p_C - m_C}{\sigma_C},$

which is measured in units of standard deviations or “sigmas.” For any value of ${z}$, the theory of normal distribution yields (one-sided) odds of a deviation of ${z}$ or greater, which one may readily look up in a table or get from an applet. For example, the value

$\displaystyle z = 2.326$

corresponds to odds of 1-in-100. What does this mean?

• When you have a single event with this deviation, it is said to be the chance of the event occurring “by chance.”

• Put another way, you are supposed to be able to say you are 99% confident that the event did not occur “by chance.”

• But what it really means is that if you have ${100X}$ such events, then you should expect about ${X}$ of them to show this much deviation or greater, just by the run of nature with no machinations.

And if you have ${1,000,000X}$ events, then expect ${X}$-many ‘miracles.’ How can we interpret this? Eighteen months ago I wrote a page to explain this in terms of golf, in which a hole-in-one is a ‘minor miracle.’ But now I’ll reference a simpler game for analogy: Marbles.

## Let’s Not Lose Our Marbles

Suppose you have a bag of 100 marbles of different grades of brilliance. If you reach into the bag and pick one, and get one that is shinier than the mean by 2.326 sigmas, you can count yourself lucky.

But if you spill them onto a white carpet, and pick one quickly, chances are you’ll notice the shiniest one. By Littlewood’s Law it’s better-than-even that it will be shinier than average by 2.326 sigmas. With 160 marbles, you should count yourself unlucky not to find such a shiny one.

If the marbles come from a big factory that fills its bags in a highly uniform manner, you can be pretty sure that your expectations apply to any bag. Or maybe the company makes bags with different “ratings” for brightness—then you can suppose a consistent mean for bags of a given rating.

In my analogy, the marbles are players—of various chess Elo ratings—and the bags are tournaments. Round-robin tournaments rarely have more than a dozen players nowadays (sixteen used to be the norm), but Opens have 50, 100, sometimes even 1,000 players. There are also perhaps 100 sizable tournaments going on worldwide in any given week or weekend, especially in the northern summer. About 20–40 of them are prominent enough to make a roundup called The Week In Chess, which England’s Mark Crowther has offered single-handedly for free for coming on 20 years. But before we deal with combinations of numbers, let’s try a simpler illustration.

## A Quiz on the Law

About 25–50 or so of the world’s top players are regularly invited to lucrative round-robin “super-tournaments.” The rest frequent Open tournaments, usually receiving modest appearance fees as well as the chance to win prize money, and the next several hundred players can also make a decent living by supplementing these competitions with coaching and writing and various other activities. Many do more playing, others more writing or teaching or officiating.

Say you are one of those several hundred, with a career going back 25 years since becoming an internationally ranked player. Now let’s run my cheating test on every one of your performances.

Suppose we do not find a single performance on which we would be 99% confident in isolation that you were cheating. Which of the following conclusions is supported materially by this result?

1. You didn’t cheat.

2. Chess programs 10–20 years ago weren’t strong enough to register on the statistical tests.

3. You derive most of your chess-related income from teaching and/or writing.

4. You have an “anti-computer” style of play.

We won’t keep you in suspense. The answer is 3. Those earning livable money from Open prizes pretty much have to play at least once a month. Playing once every two months won’t cut it, but even that makes 150 tournaments. By Littlewood’s Law, from 150 performances we have fairly high likelihood of seeing a 1-in-100 deviation on the plus side—and one on the down side.

## Players and Floods

Now suppose we have a tournament of 100 players, and we select some to test for cheating. We might select the first prize winner, or the top three or all those in a tie for first, but chances are they were among the highest rated players to begin with. Instead, as we glance over the tournament crosstable we may notice a remarkably high Plus Figure, like +250. This means the player had enough wins and draws to meet the expectation of a player rated 250 points higher. The player may still have finished in the bottom half of the tournament with more losses than wins, but if the player was among the lowest rated that could still be a highly “plus” performance and even win a so-called rating class prize. Some players are even alleged to lose on purpose in one tournament so that their rating will be below the cutoff for a class prize in the next, a practice called “sandbagging.”

So we test our Plus Player, and we get a z-score of 2.50, meaning odds of 160-to-1 against it happening “by chance,” and certainly a significant statistical outcome by civil convention. Then what do we do? The answer is: absolutely nothing. Chances are that player was just the shiny marble.

In any one weekly file of games from TWIC, there are typically 1,000–1,500 players, more in summer. Since many tournaments are split across two weeks of TWIC, this translates to about 1,000 player-performances per week. Now if we hear someone did “suspiciously well,” and the test obtains a z-score of 3.10, say, what then? That’s 1,000-1 odds, well-nigh unlikely in isolation. However, by Littlewood’s Law, one of those 1,000 players was bound to have an unusually fine weekend by the test’s measures. Since the measures correlate with the results of games, this is most likely to be the player who caused the buzz.

I’ve proposed thinking about odds in units of “weeks of TWIC,” multiplying them by 1,000. Thus if the test obtains a z-score of 4.00, for 31,600–1 odds, that’s about 7 months of TWIC. A score of 4.50, for just under 300,000–1, is about 6 years of TWIC. And 4.75, the closest round z-score to million-to-one odds, is about 20 years of TWIC. A performance with that high a deviation can be regarded the way insurance companies think of a “twenty-year flood.” And 5.00 sigmas, which physicists use as their threshold of confidence in discovery, is a 60-year flood in chess.

## It Takes a Miracle

Vetting the accuracy of my model is a separate issue. I have conducted empirical tests of the kind sketched here to calibrate its internal error bars, and I have comparison data from most of the entire top-level history of chess by which to sanity-check its conclusions. It is also possible that someone will devise a better model. So let’s take the solidity the statistical evidence as given.

I am proposing a 5.00-sigma standard in chess for statistical evidence to rise above the caveats of selection that go with Littlewood’s Law. This might be shaded to 4.75, giving exactly Littlewood’s quantification of “miracle,” something we’d expect to see in chess only once in two decades. Or perhaps to 4.50, which my post “Thirteen Sigma” noted is the end-product criterion for “Six Sigma” in industry.

In the case that occasioned my open letter to ACP, my test gave results well north of 5.00 for a single tournament. I was actually shocked to see such numbers tumble out, because in every previous case the z-scores were around 3.00 or somewhat beyond. I have also contrasted this case with three other performances considered ‘miraculous’ in chess terms that don’t even register significant deviations on my quick filtering test.

Littlewood’s Law says a single result of the lower 3.00-ish kind must be ignored, unless there is something completely distinct from game results or move-matching percentages that determines the selection—such as physical or observational evidence of cheating. But when the result is 5.00, from one or possibly a combination of tournaments in a short time span, I think it is beyond the caveats of the ‘Law’ and some action needs to be taken.

The action from a world body need not include a formal finding of cheating—that can be left as a consideration in local-authority action for recovery of prize money, for instance. One contention of my “Thirteen Sigma” post is that somewhere between z = 4.50 and z = 13.00 there needs to be a threshold on which the society of chessplayers can agree, as a burden of the privilege of competing, that exceeding it brings a sanction. Perhaps this can align with the developing understanding of the rules and procedures of blood-level readings in the sport of cycling.

## Open Problems

What does it take to be able to regard statistical evidence as primary in possible cheating cases, rather than the supporting role reserved when the z-score is under 4.00? Where should the threshold be?

1. September 18, 2013 10:50 am

Thank you, Ken, for this vital and well-begun effort to curb chess-cheating. We can all see that a tremendous body of high-level algorithmic reasoning and skilled computer programming is involved. The international chess community is immensely fortunate to benefit from your labors.

Is chess all about making the the best moves? Is mathematics all about proving the strongest theorems? That the answer to the first question is “no” — as your work vividly shows us — reminds us that the answer to the second question is “no” also.

That is why (as it seems to me) the evolving sensibilities of the chess community in regard to the crucial questions “What is chess, and what cheating in chess?” strongly overlap with the sensibilities of Bill Thurston’s much-admired essay On Proof and Progress in Mathematics.”

Perhaps in coming years you will write a summary essay “On Good Moves and Better Playing in Chess.” Any such essay would be read with great appreciation by many (including me).

Thank you again for fostering good chess, Ken Regan!

2. September 18, 2013 11:48 am

this is a great exercise in statistics, but a poor solution to the problem of cheating in chess, due to the issue of false positives, which you seem to address indirectly, but not head on. no matter how ingenious you are about your statistical methods, no matter how much you squeeze them, research them, finetune/polish them, *false positives/negatives are absolutely guaranteed*. at heart its close to a philosophical problem, not a statistical one. certainty is impossible. period.

am reminded of drug doping in sports, the baseball & cycling [lance armstrong] scandals. it seems better ethics in sports is a key part of the answer. could it be that the real issue is a society-wide decline in ethics? [possibly interrelated to improvements in technology?] such a scenario is very hard to measure/detect, but quite real in its implications.

changing the way the game is played seems to be one solution/alternative. chess will never be the same again after technology has permanently changed it, is a lot of this line of thinking actually disguised wishful thinking that this would not be the case? that is the nature of the human symbiotic relationship with technology…. it permanently alters the status quo, sometimes dramatically…. [seen in milestone events eg kasparov vs ibm computer…. still rippling outward….]

maybe you could find some other area to apply your formidable statistical prowess to. seriously.

3. September 18, 2013 11:59 am

a few more thoughts. the word “quixotic quest” is coming to mind. is it ultimately a futile, ultimately maddening pursuit of trying to derive black and white in shades of gray? there are places where statistics is of immeasurable value & leading to breakthrough new insights, such as finding causes of diseases. but in this area of chess cheating, a worthwhile pursuit [to some degree!], it is starting to remind me of the House vs gamblers in a casino. which sounds like this article. ripping from the headlines…. casino cheat uses infrared contact lenses marked with invisible ink…. did you ever see that movie 21? based loosely on a true story. there is also the similar story of edward thorp….

4. September 18, 2013 1:29 pm

& ps do you realize you are basically attempting to create a foolproof Turing test, except with Chess as the communication device?

September 18, 2013 5:29 pm

I find it fairly reassuring that there’s at least one math teacher who’s more involved in the detection of fraud in chess games than in math tests… 🙂

6. September 19, 2013 4:03 pm

further thoughts, some other links with some relevance/substantiation about the idea that maybe cheating is becoming more widespread in the culture. anyone who reads the headlines may remember this…
Half of university students willing to cheat, study finds
character disorder book, by simon. simon is a professional psychologist who argues that key aspects of freudian theory, which underpins modern psychology and treatment, are incorrect wrt “character disturbance” and that its counterproductive or even harmful to treat them under the existing paradigm. he believes there is a wider proliferation of character disturbance in the widespread culture.
Studies Find More Students Cheating, With High Achievers No Exception

• September 21, 2013 10:37 pm

VZ, thanks for your remarks, which I hear loud and clear. I don’t know of any more direct way to address false positives, than what I did in the post of actually defining my odds to mean their frequency in the quote box. I mentioned the Turing test angle in my talk a year ago in the University of Bergen’s Turing series. I agree with the character opinion here.

• September 24, 2013 3:30 pm

this is starting to remind me of mathematician jokes eg “An engineer thinks that his equations are an approximation to reality. A physicist thinks reality is an approximation to his equations. A mathematician doesn’t care.” … what about the issue of “diminishing returns”, doesnt this eventually apply somehow in the theoretical real? also, remember there is another modern/advanced technology for detecting cheating, possibly no more or less reliable…. re Simon I forgot to include a relevant blurb from his book, one of the bullet points: “The habitual behaviors the disturbed use to avoid responsibility and to manipulate, deceive, and exploit others.”

7. September 20, 2013 1:15 am

Here’s one way a player (call him bob) can use a smart (AI based) chess program to cheat and grow his rating level, while making it harder to detect the cheating:
The chess program is trained to play like bob (by analyzing all games bob played so far) and able to adjust the level of play (up or down) under bob’s control.
Then bob orchestrates a slow rise in his level of play all the while using the chess program.

Are such customized chess programs hard to develop in near future? I wouldn’t bet on it.

• September 21, 2013 10:38 pm

Clever thought—indeed, a facet of my model which I haven’t developed yet can facilitate exactly that, Randomized Bob+200.

• September 29, 2013 3:29 pm

And in general, if your algorithm (or someone else’s) and the limit “4.5 sigma” were accepted as the one fixed official method to detect cheating, then cheaters will look for (and probably find) a way to cheat that stays under the critical limit.

Very interesting article though, loved reading it.

8. September 20, 2013 11:34 am

Hi Ken, two questions/remarks

1) “How closely do the moves of players of various rating levels correlate with those of our computer superiors?”

Is it possible that even a low ranking player will adapt strategies with high than random correlation with a chess computer program by practicing with that program?

(I suppose that the answer is no, as the player cannot imitate the computational process used by the computer, but still in simple scenarios a human may perhaps adopt the “computer way of thinking”.)

2) It is a big question of how to use statistical evidence to prove guilt/misconduct etc.
Suppose for example you suspect cheating in a certain scientific experiment . It looks that the level of certainty required to disqualify the outcomes of the experiment is smaller than the level of confidence needed to prove that cheating took part. But it is very hard to do one without the other.

Here is a possible suggestion to disqualify potential cheaters while minimizing the “shaming” aspect. In every tournament you decide after the tournament on 5% (say) of participants which “do not count” for the official outcomes (winners/ranking etc.) These 5% include players with statistical signs for “cheating” as well as participants chosen at random. This way the “shaming effect” is minimized but still there is a good chance that systematic cheaters will not gain from their cheating (and if persisted will also be shamed eventually).

• September 21, 2013 10:45 pm

Gil, thanks. Regarding (1), what is considered valuable for players under 2000 to do is use online sources like chess.com and computers to practice their tactics. That is to say, pattern-recognition training is known to help amateur players for tactics, but not so much for overall strategy. Your (2) alas would conflict with the expectation everyone has of having one’s games rated—that is actually an important positive feedback which contributed to expanding chess in popularity in the late 1960’s even before Fischer-Spassky.

• September 22, 2013 12:26 am

Dear Ken, If 5% of your games (for all players) are chosen to be excluded from the ratings at random this will have, overall, very small effect on the rating. But I agree that (2) is not really practical, and I doubt if any statistically-based method can be implemented.

• September 23, 2013 6:15 pm

Dear Ken, looking at correlations you are testing “hit rate”, on the other hand, looking alive at Anand-Gelfand games, they did moves that were not suggested by the first two lines of the computer, still later with almost the same performance. The “correct rejections” take the second part of more objective measure of deviation from the computer program (in statistics it is Type I and Type II errors, although I have no idea how to implement it, due to I guess the diversity of lines), and although it is usually not recorded, the time to move may also be important in the cheating analyses, which is technically easy to implement using web-cameras and object recognition software.

September 21, 2013 7:22 pm

Ken,
Why not just require that each competing player, especially ones that do suspiciously well (z>3, even) play a few games of blitz? 5|2 should suffice.

Blitz is far too fast to have the moves broadcast, have an accomplice run it through the engine, and have the ideal move transmitted back to the player. It’s impossible to cheat at blitz.

In fact, the cheating player need not even win the blitz match. But, if he wins even 3/10 blitz games against a strong opponent which he was able to beat, it’s pretty clear he wasn’t cheating. If he loses them all, though, it should be fairly clear that he WAS cheating.

I guess the next difficult question we should ask is, out of 10 blitz games, how many must the challenger win to reaffirm his status as a non-cheater? I guess this goes back to the standard binomial distribution.

Good post, and I’d love to see a reply to this comment.

• September 21, 2013 10:52 pm

Josh, things like this have been talked about in various ways. In top-level events there is a post-game press conference that doubles as a debriefing, but no substantive cheating cases to my knowledge and supported by my data have come from there. In Open events, the universal custom of analyzing the game with one’s opponent plays this role—and one such testimony saying the opponent was not like Rosie Ruiz (who missed details of the Boston Marathon she’d just “won”—I’m there now) but knew details of their game was cited by me in a difficult report two weeks ago. The trouble with using blitz is high variance and its being about 600 points under standard-time Elos for top players according to my results. Plus requiring it would be intrusive in a way that is already seen as a barrier for expanding the postgame interview/debriefing idea.