A riff on writing style and rating systems

 Cropped from source

Mark Glickman is a statistician at Harvard University. With Jason Brown of Dalhousie University and Ryan Song also of Harvard—we’ll call them GBS—he has used musical stylometry to resolve questions about which Beatle wrote which parts of which songs. He is also a nonpareil designer of rating systems for chess and other games and sports.

Today we discuss wider issues and challenges arising from this kind of work.

In fact, we’ll pose a challenge right away. Let’s call it The GLL Challenge. Many posts on this blog have both our names. In most of them the writing is split quite evenly. Others like this are by just one of us. Can you find regularities in the style of the single-author ones and match them up to parts of the joint ones?

Most Beatles songs have single authors, but some were joint. Almost all the joint ones were between John Lennon and Paul McCartney, and in a number of those there are different accounts of who wrote what and how much. Here are examples of how GBS weighed in:

• Although the 1962 song, “Do You Want to Know a Secret?” was credited as “Lennon/McCartney” and even as “McCartney/Lennon” by a band who covered it in 1963, it has long been agreed as mostly by Lennon, as labeled on this authorship list. GBS confirm this.

• The two composers differed, however, in their accounts of “In My Life” and it has taken GBS to credit it all to Lennon with over 98% confidence.

• The song “And I Love Her” is mainly by McCartney, but GBS support Lennon’s claim to have written the 16-syllable bridge verse.

• Lennon said “The Word” was mainly his, but GBS found McCartney’s tracks all over it.

## Tell Me Why Baby It’s You

To convey how it works, let’s go back to the GLL Challenge. I tend to use longer words and sentences, often chaining further thoughts within a sentence when I could have stopped it at the comma. The simplest approach is just to treat my sole posts as “bags of words” and average their length. Do the same for Dick’s, and then compare blocks of the joint posts. The wider the gap you find in our sole writings, the more confidently you can ascribe blocks of our joint posts that approach one of our word-length means or the other.

For greater sophistication, you might count cases of two consecutive multisyllabic words, especially when a simple word like “long” could have replaced the second one. Then you are bagging the pairs of words while discarding information about sentence structure and sequencing. An opposite approach would be to model the probability of a word of length ${n}$ following a whole sequence of words of lengths ${n_1,n_2,\dots,n_r}$. This retains sequencing information even if ${r}$ is small because one sequence is chained to the previous one.

GBS counted pairs—that is, transitions from one note or chord to another—but did not analyze whole musical phrases. The foremost factor, highlighted in lots of popular coverage this past month, is that McCartney’s transitions jump around whereas Lennon’s stay closer to medieval chant. Although GBS covered songs from 1962–1966 only, the contrast survives in post-1970 songs such as Lennon’s “Imagine” and “Woman” versus McCartney’s “Live and Let Die” and the refrain of “Band on the Run.”

To my ears, the verses of the last creep like Lennon, whereas Lennon’s “Watching the Wheels” has swoops like McCartney. Back when they collaborated they may have taken leaves from each other, as I sometimes channel Dick. The NPR segment ended with a query by Scott Simon about collaborative imitation to Keith Devlin, who replied:

For sure. And that’s why it’s hard for the human ear to tell the thing apart. It’s also hard for them to realize who did it and this is why actually the only reliable answer is the mathematics because no matter how much people collaborate, they’re still the same people, and they have their preferences without realizing it. [Lennon’s and McCartney’s] things come together—that works—but they were still separate little bits. The mathematics isolates those little bits that are unique to the two people.

GBS isolated 149 bits that built a confident distinguisher of Lennon versus McCartney. This raises the specter of AI revealing more about us than we ourselves can plumb, let alone already know. It leads to the wider matter of models for personnel evaluation—rating the quality of performance—and keeping them explainable.

Glickman created the rating system Glicko and partnered in the design of URS, the Universal Rating System. Rather than present them in detail we will talk about the problems they intend to solve.

The purpose is to predict the how a player ${P}$ will do against an opponent ${O}$ from the difference in their ratings ${R_P}$ and ${R_O}$:

$\displaystyle y = f(x) \qquad\text{where}\qquad x = R_P - R_O.$

Here ${0 \leq y \leq 1}$ giving the probability for ${P}$ to win, or more generally the percentage score expectation over a series of games. The function ${f(x)}$ should obey the following axioms:

$\displaystyle \begin{array}{rcl} &&f(-x) = 1 - f(x);\\ &&f(x) \rightarrow 1 \text{ as } x \rightarrow \infty, \text{ so } f(x) \rightarrow 0 \text{ as } x \rightarrow -\infty;\\ &&f'(x) \text{ is defined and maximum at } x = 0. \end{array}$

The last says that the marginal value of extra skill tails off the more one is already superior to one’s opponent. Together these say ${f(x)}$ is some kind of sigmoidal curve, like the red or green curve in this graphic from the “Elo Win Probability Calculator” page:

To use the calculator, pop in the difference as ${x}$, choose the red curve (for US ratings) or green curve (for international ratings), and out pops the expectation ${y}$. What could be simpler? Such simplicity and elegance go together. But the paradox—a kind of “Murphy’s Law”—is:

Unless the players are equally rated, the projection is certainly wrong. It overestimates the chances of the stronger player. Moreover, every projection system that obeys the above axioms has the same defect.

Here’s why: We do not know each rating exactly. Hence their difference ${x}$ likewise comes with a ${\pm \epsilon}$ component. Thus our projection really needs to average ${f(x+\epsilon)}$ and ${f(x-\epsilon)}$ over a range of ${\epsilon}$ values. However, because ${f}$ is concave for ${x > 0}$, all such averages will be below ${f(x)}$.

We might think we can evade this issue by using the curves

$\displaystyle f_{\epsilon}(x) = \frac{1}{2}(f(x + \epsilon) + f(x - \epsilon)).$

This shifts the original ${f(x)}$ curve left and right and averages them. Provided ${\epsilon}$ is not too big, ${f_{\epsilon}}$ is another sigmoid curve. Now define ${f_*}$ by aggregating the functions ${f_{\epsilon}}$, say over ${\epsilon}$ normally distributed around ${0}$. Have we solved the problem? No: ${f_*}$ still needs to obey the axioms. It still has sigmoid shape concave above ${x = 0}$. Thus ${f_*(x)}$ will still be too high for ${x > 0}$ and too low for ${x < 0}$. The following "Law"—whom to name it for?—tries not to be hyperbolic:

All simple and elegant prediction models are overconfident.

Indeed, Glickman’s own explanation on page 11 of his survey paper, “A Comprehensive Guide to Chess Ratings,” is philosophically general:

At first, this consistent overestimation of the expected score formula may seem surprising [but] it is actually a statistical property of the expected score formula.

To paraphrase what he says next: In a world with total ignorance of playing skill, we would have to put ${f_0(x) = 0.5}$ for every game. Any curve ${y = f(x)}$ comes from a model purporting pinpoint knowledge of playing skill. Our real world is somewhere between such knowledge and ignorance. Hence we always get some interpolation of ${f(x)}$ and the flat line ${y = 0.5}$. In chess this is really an issue: although both the red and green curve project a difference ${x = 200}$ to give almost 76% expectation to the stronger player, observed results are about 72% (see Figure 6 in the survey).

## Newtonian Ratings and Grothendieck Nulls

The Glicko system solves this problem by giving every player ${P}$ a rating ${R_P}$ and an uncertainty parameter ${\epsilon_P}$. Instead of creating ${f_{\epsilon}}$‘s and ${f_*}$ (or etc.) it keeps ${\epsilon}$ a separate parameter. This solves the problem by making the prediction ${y}$ a function of ${(\epsilon_P,\epsilon_O)}$ as well as ${x = R_P - R_O}$, with optional further dependence on how the ${(R_P,\epsilon_P)}$ “glob” may skew as ${R_P}$ grows into the tail of high outliers and on other dynamics of the population of rated players.

However, Newton’s laws behave as though bodies have pinpoint mass values at their centers of gravity, no matter how the mass may “glob” around it. Trying to capture an inverse-square law for chess ratings leads to a curious calculation. Put

$\displaystyle f(x) = 1/(Ax^2 + Bx + C)$

for ${x \geq 0}$. Taking ${C = 2}$ gives ${f(0) = 0.5}$ and allows gluing ${f(-x) = 1 - f(x)}$. Simplifying ${\frac{1}{2}(f(x+\epsilon) + f(x-\epsilon)) - f(x)}$ gives a fraction with denominator ${f(x)f(x+\epsilon)f(x-\epsilon)}$ and numerator ${N(x)}$ given by

$\displaystyle N(x) = 3A^2 \epsilon^2 x^2 + 3AB\epsilon^2 x + (B^2 \epsilon^2 - 2A\epsilon^2 - A^2 \epsilon^4).$

Then taking ${B = \sqrt{2A}}$ cancels out the two bigger terms in the constant part, leaving the numerator as

$\displaystyle N(x) = 3A^2 \epsilon^2 x^2 + 3\sqrt{2}A^{3/2}\epsilon^2 x - A^2 \epsilon^4.$

David Mumford and John Tate, in their 2015 obituary for Alexander Grothendieck, motivated Grothendieck’s use of nilpotent elements via situations where one can consider ${\epsilon^2}$ to be truly negligible—that is, to put ${\epsilon^2 = 0}$.

Here we have an ostensibly better situation: In our original expression for ${f(x)}$, the coefficient ${A}$ of ${x^2}$ has to stay pretty small. The linear term for ${N(x)}$ has coefficient ${A^{3/2}\epsilon^2}$ and the ${x^2}$ term has ${A^2 \epsilon^2}$. Thus if we could work in an algebra where

$\displaystyle A^{3/2}\epsilon^2 = 0,$

then the pinpoint value ${f(x)}$ and all averages ${\frac{1}{2}(f(x+\epsilon) + f(x-\epsilon))}$ for uncertainty would exactly agree. No separate parameter ${\epsilon_P}$ would be needed.

Alas, insofar as the real world runs on real algebra rather than Grothendieck algebra, we have to keep the numerator ${N(x)}$ and the denominator ${D(x)}$. One can choose ${A}$ to approximate the above green or red chess rating curves in various ways, and then compare the discrepancy for various combinations of ${x}$ and ${\epsilon}$. The discrepancies for my “Newtonian” ${f(x)}$ tend about twice as great as for the standard curves. That is too bad. But I still wonder whether the above calculation of the prediction discrepancy ${N(x)/D(x)}$—and its curious ${A^{3/2}}$ feature—has further uses.

## Open Problems

What will AI be able to tell from our “track records” that we cannot?

Several theories of test-taking postulate a sigmoid relationship between a student’s ability ${x}$ and his/her likelihood ${f(x)}$ of getting a given exam question right. Changing the difficulty of the question shifts the curve left or right. For a multiple-choice question with ${m}$ choices the floor might be ${1/m}$ rather than ${0}$ to allow for “guessing” but otherwise, similar axioms hold. Inverting the various ${f(x)}$ gives a grading rubric for the exam. Do outcomes tend to be bunched toward the middle more than predicted? Are exam “ratings” (that is, grades) robust enough—as chess ratings are—to tell?

Aggregating the ${f(x)}$ curves for various questions on an exam involves computing weighted averages of logistic curves. Is there literature on mathematical properties of the space of such averaged curves? Is there a theory of handling discrepancy terms like my ${N(x)}$ above?

[some word tweaks and typo fixes]

August 26, 2018 4:45 am

Ken, do you still think God’s chess rating should be around 3600, given recent developments (several engines above 3400, Leela Chess Zero rated almost as high with a very different playing style, etc.)? Have you compared Leela moves to Stockfish moves in your cheating studies? Thanks.

• August 26, 2018 10:52 am

Exactly what I think is that a slightly randomized version of AlphaZero can score 25% against any strategy, i.e., hold a draw every other game. Then all strategies would have Elo no more than 200 points higher from games against it. (Appropriate technical quip: “God” is a universal quantifier.) 🙂

August 27, 2018 4:04 am

Thanks! Do you think the rating model breaks down at such high strength, e.g. could there be a strategy S (say Stockfish 20 on futuristic hardware) that also holds A0 to 25%, but that in turn scores less than 50% against “God”?

Re the technical quip, my first thought was that (generalized) chess is PSPACE-complete, so “God” would be nested quantifiers of arbitrary depth. But of course, a single universal quantifier over the natural numbers (i.e. a halting oracle) could also solve chess.

• August 27, 2018 11:01 am

More generally, I wonder if there are well-documented cases where Engine A beats Engine B, B beats Engine C, yet C beats A in matches, owing to strategic (stylistic) differences. I don’t know whether CCRL or anyone has measured such “triangles” in the rating system. One point, however, is that the randomness in my hypothetical version (call is RalphaZero?) offsets the effectiveness of particular strategies against others. One could in theory apply the randomness to whole strategies (i.e., mappings from positions to moves) rather than just the best yea-many moves in a given position—this would bring in how matrix games achieve a guarantee regardless of opposing strategy.

The universal quantifier I have in mind is over strategies—one exponential higher than the quantifiers over moves and game histories involved in the PSPACE arguments. What I mean is that there is no single “perfect” player—rather one defines a quantity such as an opposing rating ceiling by quantifying over all possible opposing strategies.

August 27, 2018 4:07 am

Additional comment: I don’t know if you’re following it or are interested, but LC0 (Leela Chess Zero) now seems to be approaching Alpha Zero’s strength, or has possibly even reached it (I’m not convinced of the latter yet). See recent forum posts at lczero.org for more info.

• August 27, 2018 11:20 am

In early June my first indications didn’t point to more than 2500 for LC0, but by mid-July and the WCCC it was at least 2950 by my IPR measure. That’s also where my measure seems to start losing distinguishing power—I think my values up to 3100 are OK but past that it’s mush. Indeed, if they make an easier-to-install non-GPU Linux/UNIX version of LC0 just for analysis not training, then I will add it to my panel of reference engines for large-scale data-taking.

I meant to emphasize, too, one aspect of the near-perfect linear fits in my “When Data Serves Turkey” post. You could get the same fit just from analyzing the games in the Elo 1800–2200 range, say, where the engine depths I use are far superior enough to be clearly authoritative. The charts in that post demonstrate than extrapolating this out to 2800 is perfect. Thus one cannot simply say that my 3400 intercept is because of my system losing mojo when analyzing games at 2800+. (There are of course other issues shown in that post, and my ironic upshot is that although IMHO either the “3200” or “3400” answer could pass peer review, both are wrong. Indeed if one used the first-line matching figures alone then 100% would be the perfection point and that would be above 4000. I believe a “true” intrinsic measure needs to blend those two with a notion of “challenge created” based on the depth-of-thinking and playing-on-impulse ideas set out here and here, but I am right now having to resort to a very heavy-handed approach to taming the resulting model.)

September 7, 2018 12:58 am

Ken, I wonder if you’ve tried play.lczero.org in easy mode: that means you’re playing directly against the neural network policy evaluation with no search at all. In other words pure positional judgment, yet it easily beats masters if they’re not careful. Sort of the opposite of what we expect from an engine, fantastic positional sense but lousy tactics instead of the other way around. Someone a while back said its likely true strength is in the 2000 range. I’d be interested in an IPR measurement if you’re up for that sometime. It might be a good “impulse vs thinking” test. It beats me easily but I’ve never been anywhere near 2000.

I was able to get CPU Leela running with some effort on an i5 linux box some months back, but got only around 30 nps, way too slow to play really well. Can you get hold of a GPU machine for your chess stuff? Leela is important, I think. It is already influencing high level human play from what I hear.

3. August 28, 2018 1:14 am

The GLL challenge would be quite an easy task for a careful human reader. Having read several posts by both, I can say that the writing styles of Dick and Ken are vastly different. Here is another fun observation (in addition to observation on the sentence lengths made by Ken): All the “Oh well”s have been written by Dick, and all the “ostensibly”s have been written by Ken 🙂

• August 28, 2018 7:17 pm

I can claim every “ostensibly” but a few of the “Oh well”s have been mine as well, often to finish a thought expressed by phone.

4. August 28, 2018 6:30 am

Re weighted averages of logistic curves, there’s a large literature on Bayesian approaches to logistic regression that might be relevant. That includes both finite averaging over model choices, and simulation-based integration over the continuous posterior for a single model. I don’t off hand know of something that talks about the properties of the space of averaged models though.

• August 28, 2018 6:34 am

And to continue the theme, Bayesian approaches to measurement error might be something to look at re the overconfidence problem.

5. August 29, 2018 4:47 pm

I appreciate the simplifying you do, but surely you’ve oversimplified here. The fact that ratings are uncertain doesn’t mean they are the average of c+eps and c-eps. The distribution is obviously more complicated, and this just shows that it mustn’t be symmetric around its mean.

• August 29, 2018 9:18 pm

Fair enough. I did write in the last main section about “optional further dependence on how the (R_P,\epsilon_P) “glob” may skew as R_P grows into the tail of high outliers and on other dynamics of the population of rated players.” Added: The handling of epsilon resembles how concepts like border rank are defined.