Framing a controversial conversation piece as a conservation law

 Snip from Closer to Truth video on DA

John Gott III is an emeritus professor of astrophysical sciences at Princeton. He was one of several independent inventors of the controversial Doomsday Argument (DA). He may have been the first to think of it but the last to expound it in a paper or presentation.

Today we expound DA as a defense against thought experiments that require unreasonable lengths of time.

Gott thought of the argument when he saw the Berlin Wall as a 22-year-old touring Berlin in 1969. He reasoned that his visit was a uniformly random event in the lifetime ${L}$ of the wall. That assumption gave him a 75% likelihood that he was not observing the wall in the first quarter of its lifetime. Since the wall was then 8 years old, that became a 75% likelihood that the wall would not last beyond 1993. It came down in late 1989.

The “Doomsday” name comes when one’s birthdate ${x}$ is regarded as a uniformly random sample from the sequence of all human births. If you are my age, ${x}$ is probably closer to ordinal 60 billion than 70 or 100 billion. We can then say we are 95% confident that we are not in the initial 5% of this sequence. That entails the sequence stopping before 1.2 trillion births. If our population levels off at 10 billion with 80 years’ life expectancy, that makes the lifetime ${L}$ of humanity extend no further than the year 12,000 AD. The upshot is that a longer ${L}$ entails asserting that our random sample gave a point unusually early in the span. The purer form of DA also argues that ${x}$ is not unusually late, giving this picture:

 Modified from Michael Stock source

This doubles the span of ${L}$ allowed with 95% confidence while giving reason—at the time ${x}$ occurs—to believe that the end is not imminent: at least about 1.75 billion more births will come after ${x}$. For ${x =}$ my birth, however, this is already a given.

## Debating DA

The dependence on which observer is taken as the reference point ${x}$ is one shiftable parameter of the DA. If you are a preteen reader, then your own birth may be closer to ordinal 70 billion in the sequence, which becomes your reference point. You can then tack on another 2,000 years to ${L}$. The earliest human cave painters may have been among the first 3 billion Homo sapiens. With regard to their reference point, ${L}$ has already gone past their 95% limit.

A more fundamental rebuff to DA comes from the equal reasonableness of an alternate uniformity assumption: that you are a uniformly random element of the set ${H}$ of all possible human beings. Only a subset ${B}$ of ${H}$ will ever be born. The longer ${L}$ is, the higher was your prior probability of belonging to ${B}$. Thus the fact of your birth can be construed as weighting the odds toward longer ${L}$ in a way that cancels out the short-${L}$ reasoning of DA.

Even when an instance of DA passes these objections, the inference remains controversial. We wrote about DA last year in connection with estimating the lifespan of open problems remaining open. A clear non-instance is trying to apply DA to estimate the lifespan of the Covid-19 pandemic. We have all been going through the span together and now is not a uniformly random sample.

The DA assumptions would however hold if an alien tourist with no prior knowledge of events dropped in on Earth today. The delicacy of the assumptions makes it significant to seek scenarios where DA firmly applies—and better, where the inference may be deemed necessary to preserve the validity of established modes of inference against extreme skeptical hypotheses. This is what we will try to argue in regard to inferences of cheating at chess.

## The 1-in-100,000 Question

We have posted numerous times about my statistical chess model, its giving judgments of odds against null hypotheses of fair play in the form of z-scores, and my means of validating them. We will take as granted for this argument that the modeling is true in the sense that the distribution of z-scores from testing honest players conforms to the standard normal distribution.

Now let us talk about chess in the years B.C.—before Covid—when the game was played over-the-board (OTB) in-person across a table. Suppose I obtained a z-score of 4.265 from a test of one player in one tournament. I have chosen this number for all of the following reasons:

• It corresponds to what I call “face-value odds” of 100,000-to-1 against the null hypothesis, as one can see from this or any similar calculator.

• It is close to my number from an actual case in the year 1 B.C., that is, last year.

• It is also typical of z-scores I have been obtaining these past three months since chess went online, at the point where certain online platforms have made their own decisions to impose sanctions. Here I must add that the platforms’ cheating detection systems avail information about the manner of play through the platform GUI that often furnishes much greater statistical evidence, whereas my minimalist model uses only the record of the moves played in the games.

Suppose there were no other relevant information about the case. How would one assess the significance of the z-score of 4.265? Here are two different ways of reasoning that—in the case of OTB chess—arrive at similar answers:

1. The Bayesian prior probability of cheating in OTB chess has been estimated between 1-in-10,000 and 1-in-5,000. Suppose the former, and consider a thought experiment in which 100,000 players are tested. For simplicity, let’s suppose all true instances—that is, cheating players—give above 4.265. We expect there to be ten of them, plus one natural occurrence of 4.265 or more. Thus the odds that our score represents a true positive are only 10-in-11. This is well short of the odds range usually needed to meet the standard of comfortable satisfaction used for example by the Court of Arbitration for Sport. Thus the 4.265 datum alone should not be sufficient grounds for sanction.

2. Suppose there were a policy of sanction above a threshold of 4.25. The sum of playing fields in events held under auspices of the International Chess Federation (FIDE) each year exceeds 100,000. Thus we would expect to find at least one z-score over 4.265 per year by natural chance, whose sanctioning would be a serious human-rights error. FIDE cannot afford a rate of one such error per year. Thus it is insufficient for sanction.

A 5.0 standard, however, gives a natural frequency of just over 1-in-3.5 million. The resulting error rate of once in 20-to-30 years might be acceptable in prospect. And the Bayesian argument based on a 0.0001 prior leaves about 350-to-1 odds against the null hypothesis, which is comfortably within the comfortable-satisfaction range as it has been applied.

FIDE nevertheless has maintained a policy that statistical evidence must be accompanied by some other kind of evidence. If a player is caught looking at a chess position in a bathroom, or found to have a buzzing device or wires on his-or-her person, or signaling behavior is observed, then in fact much lower z-scores (to a threshold of 2.50, about 160-to-1 odds, in current FIDE regulations) are deemed to lend strong support to such evidence.

 2015 Peter Doggers/Chess.com source

I posted a similar rationale on my own website in early 2012, where causal evidence is likened to the “black spot” in the novel Treasure Island.

## One More Datum

Now, however, suppose we have the 4.265 and one more piece of “evidence” that is pertinent but not as clearly causal. It could be:

• The player wore a hat that covers the ears, or

• An unusually bulky sweater (worn on a hot day), or

• Unusual gestures or movements during the games.

Say a search of the player turned up nothing, but this occurred after the sequence of games giving the 4.265, a day after the player had been put on notice of suspicion. So the extra information is not a black spot but instead a “grey spot.” What can we conclude now?

The Bayesian argument seems to depend on judging how this information affects the prior probability of cheating. Does it make cheating a more likely hypothesis? We don’t actually know. Whereas the 1-in-10,000 global prior estimate was based on knowing dozens of cases over the past decade, only a handful conformed to this level of indication—short of more obvious things like making frequent visits to the restroom or being seen with an ear adornment. The most we can say is that the datum is not irrelevant. An example of an irrelevant datum would be if the player were wearing neon green sneakers—not bulky, no wires, just a weird green.

I would like, however, to argue that the player’s membership in a smaller sample ${B}$ that is pertinent enhances the significance of the z-score. ${B}$ must be defined by criteria that are not only independent of my statistical analysis of the games but also pertinent so as to avoid selection bias. What is needed to quantify this enhancement is:

(a) to collect all (other) kinds of items on a par with the above—say ostentatious bracelets that could camouflage electronic indicators—and

(b) to establish that the frequency of players having any such accoutrement over the global mass of tournaments is at most, say, 1-in-100.

Now there are several equivalent ways to continue the reasoning. One is to say that since ${B}$ is “at worst” independent, the face-value odds are amplified by a factor of at least 100. The Bayesian mitigation then still leaves about 1,000-to-1 odds against the null hypothesis. Another is to say that in any given year, the natural chance of seeing the conjunction of ${B}$ and the z-score is at most 1-in-100. Thus aside from the frequency of true positives, a policy of sanctioning in such cases would have a prospective error rate of once in 100 years. The conjoined error rate of that and sanctioning on 5.0 in isolation would be acceptable.

A Bayesian defense attorney might still counter: Consider a thought experiment in which we test 100,000 such “bulky” players. We don’t have any new information on the prior rate of cheating by players in ${B}$. For all we know, it is still 1-in-10,000. Thus the same terms as before will apply: our experiment will expect to have 10 cheaters in ${B}$ plus the one natural false positive, leaving the odds only 10-to-1 as before. Put another way: without knowing the import of specializing to ${B}$ on the likelihood of cheating, you can’t reach any further conclusion.

## Doomsday to the Rescue

The nub of rejecting this counter-argument is that:

Because there are only about 1,000 players in ${B}$ per year, the thought experiment of testing 100,000 players in ${B}$ now takes 100 years.

Moreover, the defense attorney is asserting that the mistaken false positive has occurred unusually early in this span. If this is the first year under consideration, then it is a uniformly random event in the first 1% of ${L}$. By the same reasoning of DA, the odds of this are only 1-in-100. Compounded by the 10-1 odds against this particular score in the thought experiment being the false positive, we recover something near the 1,000-to-1 odds of the original reasoning.

We might allow that we are not in the first year of “the cheating era” in chess. The thicket of high-profile cases with solid grounds for judgment goes back a little over 10 years. The factor from DA then goes down to 1-in-10. But this still leaves the overall odds about 100-to-1 against the null hypothesis, and that is commonly taken as an anchor point for the standard of comfortable satisfaction after all mitigating factors have been addressed.

Thus I am casting the Doomsday interval argument as a defense against unreasonably long thought experiments. It restores a dimension of time that is ignored by the Bayesian objection. This dimension of time is correctly preserved in the analysis of the expected error rate from a policy of imposing sanctions under this ${B}$z combination of circumstances.

Is my line of reasoning valid? You can be the judge. If so, then it is a class of instances where DA is applied merely to conserve an inference of unlikelihood that was originally made by other means. This supports the validity of DA-type inference in general.

## Online Chess and the Time Warp

We are now in the third month of “the online era” in chess. Even though online platforms can process many more kinds of information than I can avail from OTB play, my work has proved highly relevant for global early indications, second opinions, and transparent explanations. Alas, the sanction rate at the new featured tournaments has been well in excess of 1%. We hope this will come down as the playing pool—which has been greatly democratized in massive online events—wises up to the reality of getting caught.

What I want to discuss here is how this brave new world flips the Bayesian reasoning in a way that may come on too strong for the prosecution, again by its indifference to the element of time.

Take the 4.265 z-score with a ${1\%}$ prior. The face-value odds from the z-score are now mitigated only to 1-in-1,000. This gives 99.9% confidence in imposing a sanction. However, the rate of errors would be higher than once-per-year because more players total have been involved per tournament. The tournaments are played at faster Rapid and Blitz paces allowing eight or more games per day, whereas classic OTB tournaments feature one game per day, sometimes two, over a span of a week to ten days.

This is also set against a vastly higher global sample size. Whereas the entire historical record of OTB chess represented by the ChessBase Mega database has yet to hit the 10 million games mark, the online platform Lichess has now hit 75 million games played per month. Adding in ICC and chess.com and ChessBase’s and FIDE’s own servers yields an equation that recalls Ps 90:4 and 2 Peter 3:8:

A thousand years of OTB are but a day that passes online.

For online platforms in isolation, absent anything to distinguish one player’s set of games from any other’s (such as their belonging to a highest-profile tournament), this means that even a 5.0 standard is inadequate for sure judgment. At their volume, online sites can see deviations of by natural chance more than once per day. Thus they either tolerate a higher rate of errors or adopt a standard so high as to let many more guilty parties through the sieve.

Such volume means all the more that one should hold a score of 4.265 as insufficient for judgment. This is despite the vastly higher Bayesian likelihood that a sanction based on that score is correct. The greater frequency of actual cheating does mean that the rate of error per positive reading declines, but the rate per absolute time, with regard to the fixed population of honest players, may matter more. This has accompanied deliberations of whether sanctions for online cheating must be given less permanent consequences in order to allow setting thresholds so that a high percentage of actual cheaters are flagged and the error rate can be tolerated.

## Open Problems

Does this analysis square with you? Does it help in understanding controversies over the original Doomsday Argument’s paradigm?

For another pass over the argument, suppose I get a z-score of 4.265 in a narrowly-defined event such as one country’s championship league. Does that limit the sample size, so that the score is more dispositive? The kind of reasoning in point (b) above, where we had to gather all possible indicators that would lead us to constrain the sample, would however mandate widening it at least to include other countries’ leagues. This is an aspect of the “look elsewhere” effect where the space of potential tests is widened even before actual tests are considered. Possibly it should be widened to include all tournaments with similar levels of players, in which case we are back to the “square 1” of the 1-in-100,000 section of this post. The point of the analysis of the extra datum about the player is that the sample expansion has an effective pre-defined limit.

[Added note about online cheating detection to third bullet in section 3. Clarified: “… the conjunction of these two factors” –> “… the conjunction of B and the z-score” and changed the succeeding sentence.]

1. June 7, 2020 8:29 pm

This is a wonderful post, but speaking of sports, name me a champion or record holder (e.g. Lance Armstrong, Mark McGuire, Ben Johnson, Tom Brady, etc.) and I will name you a cheater. Why should chess be any different?

June 7, 2020 10:13 pm

Are you literally claiming there is no champion or world record holder at all who isn’t a cheater? So your convinced that the Swedish curling team (damn they didn’t win last Olympics did they now I can’t remember who..but whoever it is) was juicing? I presume not and I admit cheating isn’t as rare as many assume but it’s a weird way to state it.

Imo they need to just allow the performance enhancing drugs. Contrary to popular opinion in the doses taken by athletes (not body builders) they tend not to be particularly dangerous (many athletes undergo far greater health harms merely from competing at an elite level), are less dangerous than the less detectable alternatives and, like F1 racing with cars, would inspire biomedical innovation.

But of course some people will try and cheat at everything where money or status is on the line and the whole subject of the post presumes that insofar as it is discussing effective anti-cheating measures for chess.

June 8, 2020 8:14 am

Dear F. E. Guerra-Pujol:

Thanks for your comment. The post is Ken’s but I thought I would add my thoughts. You are right about players in sports and cheating. Some are really bad cheaters, others have pushed the limits. Well they did cheat too. Perhaps the money and status are too powerful for mortals.

Best and stay safe

Dick

• June 8, 2020 11:07 am

Thanks. Same to you as well!

June 7, 2020 10:06 pm

Two comments. First, often what we are doing with these thought experiments is trying to bring out aspects of our priors that aren’t immediately apparent to us (we aren’t really logically omniscient as a probability function must be).

Here, what I take the defense attorney to be saying is really, “Rather than considering a bulky sweater on someone you already suspect of cheating (prone to confirmation bias) it’s better to estimate your answer to this thought experiment (ie how frequently is a sweater just a sweater) and then use that to condition on.

Second, in general I feel there is a big problem when you start to mix statistical arguments and common sense reasoning to identify cheaters. The problem is that in terms of statistical reasoning you have to settle on what random variable outcomes you will use before examining the case. While when we use our common sense suspicion detector we do the exact opposite.

This risks creating a whole new kind of mischief in that it encourages people to identify someone as statistically suspicious then go out and look for things like a bulky sweater or some other evidence for their guilt and plug that into the calculation as if that was ok.

But that’s fallacious. If some player wore a slightly bulky sweater but never went to the restroom the temptation is to respond to the sweater because our common sense reasoning responds to anomalies but it’s fallacious to only condition on the things that stand out and not on the lack of suspicious behavior on other measures.

I fear though that this is a natural human temptation.

• June 7, 2020 10:26 pm

Peter, thanks. I think I’ve anticipated the point in your last paragraph by carefully defining point (b) to be a union over “any” such factor: sweater U behaviors U …, hypothetically adding up to 1/100. Maybe the “any” isn’t so clear there. The sequence in the case in question was that suspicion of the item in question (it was not a sweater per-se) came first.

The argument is actually not trying to mix modes of reasoning but keeping to quantitative elements where the relation of time to sample size (and the effect on “look elsewhere” type bias) is accounted for.

3. June 7, 2020 11:51 pm

Using your original model, what is the percentage of z-scores higher than 4.265 in online games? What about professional level? Does your model predict increased cheating?

• June 8, 2020 7:34 am

Thanks. The frequency is far in excess of 0.00001 alas. Among several things left on the cutting-room floor to keep the post’s length down was the implication of having a couple dozen scores in the range 3.00 to 4.00. That may not be high enough to sanction any individual player, but out of a field of 1,000 players, one can say with astronomical confidence that over half of those scores come from not playing fair. I call this “disjunctivitis.” My model does not “predict” this kind of phenomenon—it does not have any components specific to online behaviors—but observes it.