Data science of many things including citations

 Amazon India source

Paul Allison is an emeritus professor of sociology at the University of Pennsylvania and the founder and president of the company Statistical Horizons. They provides short courses and seminars for statistical training.

Today we have a short seminar on statistics and horizons of effectiveness.

Our first topic is about citations. Did we say citations? What are we in research more interested in than citations? Allison co-wrote a paper on a “law” claimed by Alfred Lotka about how the number of citations behaves. Full details in a moment, but two upshots are:

• Over half of the papers are contributed by a few highly prolific authors.

• One-shot authors are roughly ${61\%}$ of the population but account for only a tiny proportion of the literature.

Allison’s Paper

Allison co-wrote his paper, “Lotka’s Law: A Problem in Its Interpretation and Application,” with Derek de Solla Price, Belver Griffith, Michael Moravcsik, and John Stewart in 1976. Lotka’s law, which is related to George Zipf’s famous law, alleges that over any time period in any scientific or literary field, the number ${a(n)}$ of authors with ${n}$ contributions obeys

$\displaystyle a(n) = \frac{C}{n^2},$

where ${C}$ is independent of ${n}$. This suggests a maximum of ${n_{max} = \sqrt{C}}$ on the range of ${n}$, since higher ${n}$ give ${a(n) < 1}$, but there is also a probabilistic interpretation: The law says that the total number of papers at ${n}$ is ${C/n}$, and that gives a positive constant expectation even when ${n_{max}}$ varies as ${C}$. Both cases yield that out of the total number ${T}$ of papers, which has ${T = \Theta(C\log(C))}$, over half of them are contributed by a vanishing percentage of highly prolific authors. Meanwhile, one-shot authors are roughly ${6/\pi^2 \approx 61\%}$ of the population but account for only a ${1/\log(T)}$ proportion of the literature.

However, the paper also remarks on a third case, namely making ${n_{max}}$ a fixed constant—since human time is finite in any field. This puts a sharper horizon on Lotka’s Law and changes the inferences made as the horizon is approached. The paper shows how the ${n_{max}}$ factor intrudes on other inferences they would like to draw, even between the former two cases. And never mind morerecent evidence of breakdowns in Lotka’s law.

Horizons

We have mentioned de Solla Price before in regard to his founding scientometrics. In practice this is mainly concerned with citation analysis and other productivity metrics, but its widely-quoted definition, “the science of measuring and analyzing science,” strikes us as broader. We feel there should be a component for measuring limitations of the effectiveness of the science one is practicing.

Now of course in statistics there are longstanding measures of statistical power and experiment acuity and of noise in general. Nevertheless, the cascading “(non-)reproducibility crisis” argues that more needs to be addressed. The development of software tools to counter “p-hacking” exemplifies a new layer of scientific modeling to do so—which could be called introspective modeling.

I will exemplify with two “horizons” that are apparent in my own statistical chess research. One involves estimating the Elo rating of “perfect play.” The other involves the level of skill at which my data may cease to be effective. The former has captured popular imagination—it was among the first questions posed to me by Judit Polgar in a broadcast during the 2016 world championship match—but the latter is my concern in practice. We will see that these may be the same horizon, approached either by looking down from the stars or up from the road.

I am not the first to do this kind of work or face the issue of its resolving power. Matej Guid, Artiz Perez, and Ivan Bratko made it the sole topic of a 2008 followup paper to their 2006 study of all games in world championship matches. But their indicators strike me as weak. Most simply, they do not try to estimate where their horizon is, just argue that their results are not wholly beyond it. We will try to do more—but speculatively. The first step is rock-solid—it is a big surprise I found last month.

More Data, More Resolution

I use strong chess programs to take two main kinds of data. My full model uses programs in an analysis mode that evaluates all available moves to the same degree of thoroughness and takes roughly 4–6 hours per game. My quicker “screening” tests use programs in their normal playing mode, which gives full shrift only to what’s considered the best move, but shaves the time down to 10–15 minutes. For my AAAI 2011 paper with Guy Haworth, I used over 400,000 positions from 5,700 games at rating levels from Elo 1600 to 2700 only, all run on my office and home PCs. Below Elo 2000 the available data was so scant that noise is evident in the paper’s table.

Since then, many more games by lower-rated players are being archived—much thanks to the greater availability of chessboards that automatically record moves in the standard PGN format, and to an upswell in tournaments, for youth in particular. Last year, thanks to the great free bandwidth granted by my university’s Center for Computational Research (CCR), I took data in the quicker mode from over 10,600,000 moves from just over 400,000 game-sides (counting White and Black separately) in every tournament compiled by ChessBase as well as some posted only by The Week in Chess (TWIC) or provided directly by the World Chess Federation (FIDE). My two main test quantities are:

• The percentage of the computer’s best move being the one the player chose (“MM%”).

• The average error judged by the computer per move, scaling down large differences (“ASD”).

My 2011 paper found strong linear relations of these quantities to the players’ rating, and great ASD fits on a 3-million-move data set are shown graphically in this post. With MM% and my new 2018 data, here is what I see when I limit to the 1600–2700 range, grouping in “buckets” of 25 Elo points. All screenshots are taken with Andrew Que’s Polynomial Regression applet. They all show data taken with Stockfish 9 run to search depth at least 20 and breadth at least 200 million nodes; the similar data for the chess program Komodo 11.3 gives similar results.

Even with vastly more data, there still does not appear any reason to reject the simple hypothesis that the relation to rating is linear. Not only is ${R^2 > 0.99}$, the quality of fit is terrific. The noise under Elo 2000 is minimal.

But now I have over 14,000 moves in individual buckets clear down to the FIDE minimum 1000 rating; only the 2750 bucket with 11,923 moves and the 2800-level bucket with 6,340 (from just a handful of the world’s elite players) lag behind. When those buckets are added, here is what we see:

The linear hypothesis is notably less tenable. Instead, a quadratic polynomial fits supremely well:

Thus it seems I must admit a nonlinearity into my chess model. This may not be just about slightly improving my model’s application to players at the ends of the rating spectrum. Philosophically, nonlinearity can be a game-changer: the way Newtonian physics is fine for flying jets all around the globe but finding your neighbor’s house via GPS absolutely requires Einstein.

The Horizon Issue

Let us flip the axes so that ${X}$ is MM% and ${Y}$ is rating. Then the intercept of ${X = 100\%}$ would give the rating of perfect agreement with the computer. Well, here is what we see:

Having the rating of perfect agreement be about 1950—which is a amateur A-level in the US—is ludicrous. The greater import is how the increase stops at Elo 3000 with matching just under 75%. The serious implication I draw is that this helps locate the horizon of effectiveness of the data and my methods based on it. Meanwhile, I’ve had the sense from applications that my full model based on smaller higher-quality data is coherent up to about 3100 but cannot tell differences above that.

Indeed, there is a corroborating indicator of this horizon: The top chess programs, or even different (major) versions of the same program, don’t even match each other over 75% with regularity. Moreover, the same program will fairly often change to a different move when left running for more time or to a greater search depth. If it didn’t change, it wouldn’t improve. Thus my tests, which have no foreknowledge of how long a program used to cheat was running and on how powerful hardware, cannot expect to register positives at a higher rate. The natural agreement rates for human players range from about 35% for novices to upwards of 60% for world champions.

The Average-Error Case

The fit to average scaled error-per-move (ASD) shows the other side of the horizon issue. The ASD measure is more tightly correlated to rating—as the graphs in the above-mentioned post suffice to indicate. Here is the corresponding graph on the new data, again with flipped axes:

Only under 1250 Elo does perfect linearity seem to be countermanded. The issue, however, is at the other end. Committing asymptotically zero error seems to be a more acute indicator of perfection than 100% agreement with a strong program. However, the ${Y}$-intercept there is given as a rating under 3300, whereas computer programs have been reliably rated above 3400, and very recently over 3500. Thus we’d appear to have computers rated higher than perfection.

One can move from the above indication of my setup losing mojo before 3000 to allege that it is insufficient for fair judgment of human players above 2500, say, so that the intercept is not valid. My counter-argument is that the same intercept is also a robust extrapolation from the range 1500 to 2500 where the linear fit is nearly perfect and the computer’s sufficiency for authoritative judgment of the players is beyond doubt.

Nevertheless, the above “game-changer” for the move-matching percentage suggests the same for ASD. A quadratic fit to ASD produces the following results:

Now the ${Y}$-intercept at ${X = 0}$ is within error bars of 3500, in agreement with the 3475 figure currently used in my full model and less starkly under the measured ratings.

One More Riff

Let us think of move-matching for a given rating ${R}$ as a flip of a biased coin with heads probability ${p = p_R}$. If we plot ${R}$ not against ${p}$ but against ${p\cdot p(1-p)}$, we recover a nearly perfect linear fit (the plot shows ${4p^2 (1-p)}$):

Well, ${p(1-p)}$ is the variance of one coin flip. Why should multiplying ${p}$ by this variance recover a linear fit in the mean? Only multiplying ${p}$ by the square root of ${p(1-p)}$ still leaves a significantly non-linear plot.

Recovering ${p}$ from ${p^2(1-p)}$ needs solving a cubic equation. The maximum value is at ${p = \frac{2}{3}}$ and is ${\frac{4}{27}.}$ Multiplying by ${4}$ as in the plot makes ${\frac{16}{27} \approx 0.592593}$ the maximum solvable value. This regression line associates this to a rating of only 2860. This suggests a tangibly lower horizon. It also seems contradicted by the fact of Magnus Carlsen maintaining a rating over 2860 from January 2013 through June 2015, yet his engine agreement did not approach 66.7%.

Open Problems

We’ve connected the horizon of perfect play to whether the fundamental relationship of rating to agreement with strong computer programs is linear, quadratic, or indirectly cubic. Which relationship is true? What further tests may best ascertain the range of effectiveness of inferences from these data?