# Predicating Predictivity

*Plus predicaments of error modeling*

Cropped from Bacon Sandwich source |

Sir David Spiegelhalter is a British statistician. He is a strong voice for the public understanding of statistics. His work extends to all walks of life, including risk, coincidences, murder, and sex.

Today we talk about extending one of his inventions.

His invention has to do with grading the performance of people and models that make predictions. A **scoring rule** grades how often predictions are right. But it may not tell how difficult the situations are. It is easy to look good with predictions when they start with a high chance of success. A weather forecaster predicting sunny-versus-rainy will be right more often in Las Vegas than in Boston. Quoting this FiveThirtyEight item:

If you want to have an easy life as a weather forecaster, you should get a job in Las Vegas, Phoenix or Los Angeles. Predict that it won’t rain in one of those cities, and you’ll be right about 90 percent of the time.

In a 1986 paper, for a particular scoring rule defined by Glenn Brier in 1950, Spiegelhalter worked out how to equalize the forecaster grading. He applied his **Z-test** not to weather as Brier was concerned with but to medical prognoses and clinical trials.

What I am doing with a small group of graduate students in Buffalo is trying to turn Spiegelhalter’s kind of Z-test around once more. If a forecaster fares poorly, we will try to flag not the model but the behavior of the subjects being modeled. In weather we would want to tell when Mother Nature, not the models, has gone off the rails. Well, we are actually looking for ways to tell when a human being has left the bounds of human predictability for reasons that are inhuman—such as cheating with a computer at chess. And maybe it can shed more light on whether our computers can possibly “cheat” with quantum mechanics.

## Prediction Scores

Let’s consider situations in which the number is usually more than , that is, usually more than “rain” or “no rain.” The forecaster lays down projections for the chance of each outcome. If outcome happens, then the *Brier score* for that forecast is

If the forecaster was certain that would happen and so put , all other , then the score would be zero. Thus lower is better for the Brier score.

If you put probability on the outcome that happened, then you get penalized both for the difference and for the remaining probability which you put on outcomes that did not happen. It is possible to *decompose* the score in another way that changes the emphasis:

Then is a fixed measure of how you spread your forecasts around, while all the variability in your score comes from how much stock you placed in the outcome that happened. The worst case is having put , whereupon your Brier penalty is .

We would like our forecasts always to be perfect, but reality gives us situations that are inherently nondeterministic—with unknown “true probabilities” . The vital point is that the forecaster should not try to hit on the nose at every time but rather to match the true probabilities. Once we postulate , the *expected Brier score* is

This is uniquely minimized by setting for each , which defines as a **strictly proper** scoring rule. Without the second term in (1) the rule would not be proper for . When , becomes equal to . Thus represents an unavoidable prediction penalty from the intrinsic variance. If all are equal, , then the expected score cannot be less than .

A second example, the log-likelihood prediction scoring rule, is in the original longer draft of this post.

## Spiegelhalter’s Z

Spiegelhalter’s -score neatly drops out the unavoidable penalty term by taking the difference of the score with the expectation. Schematically it is defined as

where means the projected variance . However, here is where it is important to notate the whole series of forecasting situations with outcomes for each . The actual statistic is

The denominator presumes that the forecast situations are independent so that the variances add. The numerator expands to be

The original application is a confidence test of the “null hypothesis” that the projections are good. Thus we plug in for all and so that we test

To illustrate, suppose we do ten independent trials of an event with four outcomes whose true probabilities are . The sum in parentheses is . If the outcomes conform exactly to these probabilities then equals once, twice, three times, and four times. This exactly cancels the , so makes , as expected. Most trials will give a nonzero numerator, but in the long run, the numerator divided by tends toward zero and the denominator scales to match it, thus keeping the -statistic normally distributed.

A high , on the other hand—highly positive or highly negative—indicates that the forecasting is way off. That (2) is an aggregate statistic over independent trials justifies treating the -values as standard scores. This applies also to -tests made similarly from other scoring rules besides the Brier score. The test thus becomes a verdict on the model. High -values on certain subsets of the data may reveal biases.

Our idea is the opposite. Suppose we know that the forecasts are true, or suppose they have biases that are known and correctable over moderately large data sets. We may then be able to fit as an unbiased estimator (of zero) over large training sets. Then it can become a judgment of whether the data has become unnatural.

## Why This Z?

As I have detailed in numerous posts on this blog, my system for detecting cheating with computers at chess already provides several statistical -scores. Why would I want another one?

The motive involves the presence of multiple strong chess-playing programs, each with its own quirks and distribution of values for moves. They are used in two different ways:

- As inputs telling the relative values of moves , which my model converts into its probability projections .
- As output predicates telling how often the player chose the move recommended by a specific program and/or quantifying the magnitude of error for different played moves.

Having multiple engines helps point 1. My intent to blend the *values* from different engines has been blunted by issues I discussed here. Thus I now have to train my model separately (and expensively) for each (new version of each) program. I can then blend the , but point 2 still remains at issue: My tests measure concordance with a specific program. Originally the program Rybka 3 was primary and Houdini 4B secondary. Now Stockfish 7 is primary and Komodo 10.0 secondary—until I update to their latest versions. The second engine is supposed to confirm a positive result from the first one. This already means that my model is not trying to detect exactly which program was used.

Nevertheless, my results often vary between testing engines. The engines compete against each other and may be crafted to disagree on certain kinds of moves. They agree with each other barely 75–80% in my tests. I would like to factor these differences out.

The Spiegelhalter -test appeals because its reference is not to a particular chess program, but to the prediction quality of my model itself—which per point 1 can be informed by many programs in concert. It gives a way to *predicate predictivity*. A high value will attest that the sequence of played moves falls outside the range of predictability for human players of the same rated skill level.

## The Method

To harness for some scoring rule , we need to quantify the nature of my model’s projections. In fact, my model has a clear bias toward conservatism in judging the frequency of particular non-optimal moves. This is discussed in my August post on my model upgrade and shown graphically in an appended note on why the conservative setting of a “gradient” parameter is needed to preserve dynamical stability. The fitting offsets this in a way that creates an opposite bias elsewhere. I hope to correct both biases at the same stroke by a specific means of modeling how the err with respect to the postulated true probabilities .

We postulate an original source of error terms all i.i.d. as , where governs the magnitude of Gaussian noise. This noise can be *transformed* and related in various ways, e.g.:

- ,
- ,
- ,
- ,
- ,
- .

There are further forms to consider and it is not yet clear from data within my model which one most applies. We would be interested in examples where these representations have been employed and in observations about their natures.

Given the error terms, we can write each as a function of and . One issue is having at most degrees of freedom among , owing to the constraint that the as well as sum to . We handle this by choosing some fixed as the “pivot” and using the constraints to eliminate and , leaving the other error terms free. In all cases, the proposed method of defining what we notate as is:

- Substitute the terms with for each free into .
- Compute the expectation over for the numerator and denominator of (2), separately.
- Holding the other previously-fitted model parameters in place, fit so that is zero over the training set (or sets, for each level of Elo rating , so becomes a function of ).

If the resulting -scores parameterized by make sense, the last step will be adjusting them to conform to normal distribution, via the resampling process mentioned recently here and earlier here. We are not there yet. But observations from Spiegelhalter tests with (equivalently, with fixed to zero) suggest that the resulting single, authoritative, “pure” predictivity test may rival the sharpness of my current tests involving specific chess programs.

## Error Quirks and Queries

To see a key wrinkle, consider the first error form. It is symmetrical: . When we substitute for and take , the symmetry of around makes it drop out of the numerator of (2), and out of everything in the denominator except one place where becomes . There is hence nothing for to fit and we are basically left with the original Spiegelhalter .

In the second form, however, we get . If we presume small enough to make the distribution of outside negligible, then we can use the series expansion to approximate

Under normal expectation, the odd-power terms drop out (so their signs don’t matter) and we get

This credits as being greater than . Provided the projections for the substituted indices were generally slightly conservative, this has hope of correcting them.

Already, however, we have traipsed over some pitfalls of methodology. One is that the normal expectation

regardless of how small is. For any , regions around the pole get some fixed finite probability. Another is the simple paradox of our second form saying:

is an unbiased estimator of , but is not an unbiased (or even finite) estimator of .

A third curiosity comes from the fourth error form. It gives , so . We have

exactly, without approximation. Again the sign of does not matter. So we get

But by the original fourth equation we get

So we have and , with both expectations being over the same noise terms. This is like the famous Lake Wobegon syndrome. What it indicates is the need for care in where and how to apply these error representations.

## Open Problems

Have you seen this idea of directly testing (un)predictability in the literature? Might it improve the currently much-debated statistical tests for quantum supremacy?

Which error model seems most likely to apply? Where have the paradoxes in our last section been noted? **Update 1/20:** The answer appears to be error model 5 with the Brier score but zeroing the weight of the best move, after runs in the past month on data obtained via UB’s Center for Computational Research (CCR).

[some wording tweaks, added update]

gib er ish