Cropped from Schock Prize src |
Michael Aschbacher is a professor of mathematics at Caltech. He was a leading figure in the classification of finite simple groups (CFSG).
Today Ken and I ask a simple question about proofs that appeal to CFSG.
The CFSG was originally announced in 1983 amid articles totaling tens of thousands of journal pages. The case of “quasithin” groups was however found to be incompletely analyzed, and it took until 2004 to close it in papers by Aschbacher with Stephen Smith. In a survey for AMS Notices that same year, Aschbacher reviewed what that means for confidence in the whole classification.
Here we are not playing on any doubt about the proof. Instead we focus on its power: What simply-stated facts about groups can one obtain using CFSG that are unknown without it, or known only in weaker forms? We could go through a whole book by Smith on applications of CFSG, but we’ll just choose one kind of example about hw many elements are needed to generate a group.
Generation of a group is connected closely with elementary approaches to group isomorphism.
Definition 1 For a finite group let be the cardinality of the smallest generating set of .
It is easy to prove that is at most for any group of order . Adding any new generator at least doubles the size of the group through the products of with existing elements. The bound is tight for the abelian group . For simple groups in contrast we have:
Theorem 2 Assuming CFSG, if is a simple group, then .
If the group is an abelian simple group then . The above theorem was a longstanding conjecture, and is only proved now by using the CFSG. The proof is quite direct: For each type of family of non-abelian simple groups, one must check that they are all -generated.
I wondered the other day could if we prove something non-trivial about the generation of simple groups without the CFSG. The best I could see quickly is:
Theorem 3 Let be a non-abelian simple group of order . Then is generated by at most
elements.
Of course this does not use the CFSG. It beats the “trivial” bound by a factor of 2 in the case of simple groups.
Proof: Now must have an element of order at least . This follows since a group with only order divisible by and is not simple unless it is abelian. Then, we argue that is generated by the conjugates of .
Lemma 4 Suppose that is a subgroup of . Also suppose that and is not in and is a prime. Then
has cardinality at least .
Proof: Of course is a subgroup of that contains . We claim that the cosets
for are all distinct. Suppose that
Then is in . But since is a prime this implies the lemma.
The CFSG is also used to give bounds on generators for general groups. In a case of simultaneous discovery of a theorem 29 years ago, Andrea Lucchini proved and Robert Guralnick also proved the following theorem:
Theorem 5 (CFSG) For any finite group and prime , if each Sylow- subgroup has then .
It follows that if is maximum such that divides for some prime , then . This bound too, however, requires CFSG. What bounds can we obtain without it?
Fast-forwarding to a year ago, Lucchini includes Guralnick as a co-author on work analyzing the graph whose nodes are members of and is an edge of generate . Since the identity is an isolated vertex unless is cyclic, often it is omitted to eave what we’ll call . When is finite and simple, not only is nonempty, but is connected and has diameter 2.
The graph also figures into a 2016 paper by Lucchini with Peter Cameron and Colva Roney-Dougal, which has the broad title, “Generating sets of finite groups.” This paper has a lot of interesting computational results and does not seem to lean heavily on CFSG. So it may be a conduit for asking and answering some of our questions about the power of CFSG.
I assume that even without CT there is a much better bound on the generators for a non-abelian simple group. I have yet to find it. So the open problem is to send us a better citation.
]]>
And perhaps how to remove them
Cropped from Monash interview src |
Elizabeth Croft is Dean of Engineering at Monash University in Australia. She was previously at the University of British Columbia in Canada and was the BC and Yukon Chair for Women in Science and Engineering.
Today Ken and I want to discuss barriers to gender equity in computer science.
Our central thesis is: One of the problems that makes it hard to attract women researchers is an environment of countless micro barriers. This is also the position in an article that Croft wrote earlier this year for the Sydney Morning Herald. The article is titled, “Masculine culture and micro barriers still major issues for women.” It argues that fundamental improvements in equity for STEM areas will come only after a change in culture.
Croft’s article includes several specific recommendations for universities and related institutions—we abbreviate those after the first:
All these are laudable, and articles like this show some progress, but there’s a distinction. Items 3, 4, and to a large extent 2 can be handled by “macro” measures: employment policies; pay equity standards; provision of personal leave and child care; family-friendly scheduling; bias-awareness training. Item 5, however, is “micro” and so really is 1.
The micro-barrier thesis has two further implications. One is known but insidious; the other we feel is often missed. Both are effects of the culture referenced above but on the widest of scales.
To show the first effect, we searched “tech startups” without quotes on Google Images. Many hits are icons of companies or diagrams but some are photos of real people. Here is a collage of the first hits seen with people, not counting two advertisements with a man on a ladder and several showing hands that are clearly male:
To be fair, in the next five hits is a photo from an article on women-led tech startups in Africa. Now here are the top results for “computer science graduates”:
The middle item in the top row is a stock image. Again to be fair, hits 11–30 showed more balance—as did searches on “computer science students” and “computer science graduate students.”
The broadcast reality remains that of a heavily male field. The reasons given here for “why aren’t more women involved in CS?” trace the dropoff to the advent of micro PCs as ‘boy toys’ in the 1980s. We agree with that origin. What we have now, unfortunately, adds a layer of self-perpetuation from the images that young people see.
The same degree of male pervasion shows up in language too. Is there a fix that doesn’t feel like trying to contain the sea?
The second effect is how the culture plays into male expectations in ways invisible to many, even to those who are champions of diversity. We feel we can best express this with a few examples from women’s groups themselves. Then it is clear that no harm was intended. We don’t mean to criticize, just raise some more awareness. We offer questions without having ideas for even half the answers.
A Picture
Let’s look at a site for diversity in computer architecture. The site starts with the following picture:
Then it adds: “In part one of our series on gender diversity within the subdiscipline of computer architecture, we present some data that provides signal on where our community stands today with respect to gender diversity.” So the purpose following the picture is clear, but here are our questions:
A Quote
We all like to add quotes to enhance our web sites—we do that all the time here at GLL. The TCS Women site is for advancing diversity in theoretical computer science. The front page starts with their mission statement and immediately transitions to an infamous sexist quote in mathematics:
The quotation is known from the book A Mathematician’s Miscellany by John Littlewood, who thought nothing of appending the parenthetical explanation “(She was very plain).” Again we have some questions:
My reaction was confusion on the first and second point both. Ken’s is that he finds both the quotation and its juxtaposition funny—but in a way that may be “manfunny.” In either case we fear it is giving too quick shrift to male preoccupations when great female achievements could be hailed in that place instead.
A Word
The term “badass” has recently emerged as a goal-word for women in tech. The original pejorative male-rooted meaning has turned around to describe women as strong, assertive, formidable, no-nonsense. An example of its embrace is the 2016 book My Badass Book of Saints: Courageous Women Who Showed Me How to Live. For many female achievers this is great, and we applaud them, but we have questions:
The whole situation calls to mind Pogo Possum’s famous saying: “We have met the enemy, and he is us.”
There are of course other aspects of the culture that are more overt. There are micro-aggressions exemplified by back-handed compliments (“you code well for a girl”) and widening criticism of a female colleague to the whole gender. Three female computer science students at MIT in 2014 discovered a self-perpetuating aspect of the culture, in that the audience for their “ask me anything” forum behaved in a way that exceeded what Croft terms a “locker room” in the Monash interview linked under her photo at top. So we are brought back to the main question:
What can be done to change the culture?
The one suggestion we’ve come up with is to recognize this as really the problem of achieving a plural culture in a smaller community. The meaning of diversity should refer not just to the community’s composition but to the recognition of diverse ways to be effective and succeed.
This is of course hard. In academia a computer science department is smaller and more uniform than the university as a whole. On the scale of a university, diversity of ways translates most readily to diversity of majors—and starting from a computer science context, that means away from CS. So the challenge is to create a diversity of norms for success without cutting away from the common course structure and grading standards.
As for how to achieve this, at this point we have only analogies. One comes from Ken’s take on the movie Ocean’s Eight, which is that the director Gary Ross and his co-writer Olivia Milch diverged from a straight emulation of the all-male “Rat Pack” in the previous “Ocean’s” movies. They were interviewed in an article that uses “badass” in its title but the qualities that emerge are described at the end as “relatable.” On Ken’s viewing, the eight women succeed more through their own culture than by adopting male culture.
Ken and I could go on with further examples of such micro barriers. Many focus on appearance, some on other issues. All are small issues. They are not extreme statements like: women should be
But I do wonder whether the barriers accumulate. Do they accumulate and tip the scales toward having fewer women in computer science. What do you think?
By the way we at GLL are well aware that most of our readers are male. I wish we could change that but oh well.
[added “locker room” reference]
]]>[ Toeplitz ] |
Otto Toeplitz was a mathematician who made key contributions to functional analysis. He is famous for many things, including a kind of matrix named after him.
Today we discuss one of his conjectures that remains open.
Over a century ago, in 1911 to be exact, Toeplitz proposed the inscribed square problem (ISP):
Does every Jordan curve contain an inscribed square?
More precisely, if is a Jordan curve, then we want four distinct points on so that they form a square. The dotted curve in Wikipedia’s figure has four squares.
The notion of a curve, now called a Jordan curve, is named for Camille Jordan. Sometimes called a plane simple closed curve, a Jordan curve is a non-self-intersecting continuous loop in the plane. More precisely it is the image of an injective continuous map of a circle into the plane. The famous Jordan curve theorem says:
Every Jordan curve divides the plane into two connected regions—one bounded and one unbounded.
This statement is one of those obvious ones that is actually hard to prove. The reason it is actually hard to prove is critical to our discussion of the ISP. The reason is that Jordan curves can be quite nasty. Here are two key issues with Jordan curves:
Thus (1) says that a Jordan curve can be infinitely long. This is already a problem with our notion that a curve is just something that you draw with a pen. Or if you draw the curve with a pen, then it takes a very long time to finish drawing the curve. Such curves are strange, but they are perfectly fine examples of Jordan curves. Next (2) says that a curve can have positive area or measure. Curves with positive area are now called Osgood curves after William Osgood who found the first example of such a curve. Intuitively, such a curve hardly seems to be a “curve”, but they are nevertheless.
Even “nice” Jordan curves with finite length can be nasty.
Not very intuitive to me. Even the above which is not one of the examples (1,2) is nasty looking. By the way Jordan found a cool proof of the following:
If is a Jordan curve with finite length, then has zero area.
See if you can figure this out—we supply a proof at the end of this post.
So far the ISP is still open for arbitrary Jordan curves. It has been proved for convex curves and sufficiently smooth curves, but the question remains open in general. Even for polygonal curves—curves that come from polygons that are not self intersecting—it is non trivial. See the references below for some of the main results.
Curiously, if a Jordan curve is nasty, then the ISP is true. That is:
Theorem: Suppose that a Jordan curve has positive measure. Then there is an inscribed square that lies on . Thus in this case the ISP is true.
The punch line here is that “bad” curves are well behaved with respect to having inscribed squares. The proof is based on a standard probabilistic argument. We will need the Lebesgue’s density theorem. Suppose that is a measurable subset of the plane. Now consider some small enough disk around a random point in the plane. Then either almost all of the disk is in or almost none. This “0-1” type law is a bit surprising, but it is quite useful. In particular, if has positive measure, then almost all points in have the density-1 property. Here is a nice comment on this from Terry Tao’s blog:
In other words, almost all the points of are points of density of , which roughly speaking means that as one passes to finer and finer scales, the immediate vicinity of becomes increasingly saturated with . (Points of density are like robust versions of interior points, thus the Lebesgue density theorem is an assertion that measurable sets are almost like open sets. This is Littlewood’s first principle.)
Now let’s show that ISP is true for Jordan curves with positive measure.
Proof: Suppose that is a Jordan curve that has positive measure. Then by the Lebesgue’s density theorem there is an open disk so that has measure almost . We then argue that if we randomly select a square in the set it has almost probability to be inscribed on . This follows by a standard union bound. This then shows that there must be a square that is inscribed on and that ISP is true for this curve.
I am sure that this is well known to experts, but I include it here to highlight the result. I find it interesting that we can turn the “pathological” properties of a curve to our advantage. When given lemons, make lemonade. This is of course a recurrent theme in complexity theory: For example, we know that random boolean functions have high complexity.
Here are some references to work on ISP.
“Differentiability of Lipschitz functions, structure of null sets, and other problems”
by Giovanni Alberti, Marianna Csorynei, David Preiss
“Splitting Loops And Necklaces: Variants Of The Square Peg Problem”
by Jai Aslam, Shujian Chen, Florian Frick, Sam Saloff
“Squares and other polygons inscribed in curves”
by Elizabeth Denne
“Transversality in Configuration Spaces and the `Square Peg’ Theorem”
by Jason Cantarella, Elizabeth Denney, and John McCleary
“Every smooth Jordan curve has an inscribed rectangle with aspect ratio equal to ”
by Cole Hugelmeyer
“A Combinatorial Approach to the Inscribed Square Problem”
by Elizabeth Kelley
“Squares on Curves”
by Dongryul Kim
“A survey on the Square Peg Problem”
by Benjamin Matschke
“Figures Inscribed in Curves A short tour of an old problem”
by Mark Nielsen
“The Discrete Square Peg Problem”
by Igor Pak
“On The Unfolding Of Simple Closed Curves”
by John Pardon
“On The Number Of Inscribed Squares In A Simple Closed Curve In The Plane”
by Strashimir Popvassilev
“The Jordan curve theorem is non-trivial”
by Fiona Rossa and William Ross
“Two discrete versions of the Inscribed Square Conjecture and some related problems”
by Feliu Sagols and Raul Marin
“A Trichotomy for Rectangles Inscribed in Jordan Loops”
by Richard Schwartz
“An Integration Approach To The Toeplitz Square Peg Problem”
by Terence Tao.
Of course the key problem is the full conjecture, which remains open. But a possible approachable problem seems to be this: If a Jordan curve has infinite length, can we show that conjecture is true?
Jordan’s proof of his claim: Suppose is a curve that has finite length . Partition the plane into a grid of side length for some to be selected in a moment. Also divide the curve into parts of length . This is possible since has finite length. The key is: each part of the curve hits at most squares of the grid. Thus the area of the curve is easily seen to be bounded by
but this tends to as goes to infinity. Done.
[three->four squares in intro]
Clarke Chronicler blog source |
Marmaduke Wyvill was a British chess master and Member of Parliament in the 1800s. He was runner-up in what is considered the first major international chess tournament, London 1851, but never played in a comparable tournament again. He promoted chess and helped organize and sponsor the great London 1883 chess tournament. Here is a fount of information on the name and the man, including that he once proposed marriage to Florence Nightingale, who became a pioneer of statistics.
Today we use Wyvill’s London 1883 tournament to critique statistical models. Our critique extends to ask, how extensively are models cross-checked?
London is about to take center stage again in chess. The World Championship match between the current world champion, Magnus Carlsen of Norway, and his American challenger Fabiano Caruana will begin there on November 9. This is the first time since 1972 that an American will play for the title. The organizer is WorldChess (previously Agon Ltd.) in partnership with the World Chess Federation (FIDE).
The London 1883 tournament had two innovations. It was the first to use chess clocks. The second was that in the event of a game being drawn, the players had to play another game, twice if needed. Only after three draws would the point be considered halved, and this happened only seven times in 182 meetings. Chess clocks have been used in virtually every competition since, but the second experiment has never been repeated—the closest to it will come next year. Two scientific imports were that the time to decide on a move was regulated and games without critical action were set aside.
Chess has long been considered a (or “the”) “game of science.” It has been the focus of numerous scientific studies. Here we emphasize how it is a copious source of scientific data. Millions of games—every top-level game and nowadays many games at lower levels—have been preserved in databases. Except that the time taken by players to choose a move at each turn is recorded only sporadically, we have easy access to full information about each player’s choices of moves.
What we also have now is authoritative judgment on the true values of those choices via analysis by strong computer chess programs. Those programs, called “engines,” can beat even Carlsen and Caruana with regularity, so we humans have no standing to doubt their judgments. The programs’ values for moves are a robust quality metric, and correlate supremely with the Elo Rating, which provides a robust skill metric.
The move values are the only chess-specific input to my statistical choice model. I have covered it several times before, but not yet in the sense of going “back to square one” to say how it originated—where it fits among decision models.
This year I have overhauled the model’s 28,000+ lines of C++ code. More exactly I have “underhauled” it by chopping out stale features, removing assumptions, and simplifying operations. I widened the equations to accommodate multiple alternative models and fitting methods, besides the ones I’ve deployed to judge allegations of cheating on behalf of FIDE and other chess bodies. The main alternative discussed here is one I did already program and reject nine years ago, but having recently tried multiple other possibilities reinforces the points about models that I am making here. So let’s first see one general form of decision model and how the chess application fits the framework.
The general goal is to project the probabilities of certain decision choices or event outcomes in terms of data about the and attributes of the decision makers. An index can refer to multiple actors and/or multiple situations; we will suppress it when intent is clear. The index refers to multiple alternatives () in any situation and their probabilities . The goal for any is to infer as a function of , , and internal model parameters . The function is “the” model.
The models we consider all incorporate into a function that takes and and outputs quantities —which we will speak of as single numbers but which could be vectors over a separate index . In many settings, this represents the utility of outcome for the actor or situation , which the actor wants to maximize or at least gain enough of to satisfy needs. Insofar as depends on it is distinct from a neutral notion of “objective value” . Such a distinction was already observed in the early 1700s.
The multinomial logit model, and log-linear models in general, represent the logarithms of the probabilities as linear functions of the other elements. Using the utility function this means setting
where we have suppressed and could be multiple linear terms . This makes
for all . Then becomes a normalization constant to ensure that the probabilities sum to , dropping out to give the final equations
Fitting thus yields all the probabilities. Note that putting a difference of probabilities on the left-hand side of (1), which is the log of the ratio of the probabilities, leads to the same model and normalization (up to the sign of ). The function of normalizing exponentiated quantities is so common it has its own pet name, softmax.
These last three equations were known already in 1883 via the physicists Josiah Gibbs and Ludwig Boltzmann, with coming out in units of inverse temperature, the denominator of (3) representing the partition function of a physical system, and the numerator the Boltzmann factor. It seems curious that apart from some contemporary references by Charles Peirce they were not used in wider contexts until the World War II era. Equation (3) essentially appears as equation (1) in the 2000 Economics Nobel lecture by Daniel McFadden, who calls it “the” multinomial logit model (see also this) and traces it to work by Duncan Luce in 1959. Such pan-scientific heft makes its failure in chess all the more surprising.
In chess tournaments we have multiple players, but only one is involved in deciding each move. So we focus on one player but can treat multiple players as a group. Instead what we represent with the index is the player(s) facing multiple positions. To a large extent we can treat those decisions as independent. Even if the player executes a plan over a few moves the covariance is still sparse, and often players realize they have to revise their plans on the next turn. Thus we replace by an index signifying “turn” or “time” for each position faced by the player. Clearly we want to fit by regression over multiple turns , so and any other fitted parameters will not depend on , which again we sometimes suppress.
Each possible move at each turn is given a value by the chess engine(s) used to analyze the games. We order the moves by those values, so the engine’s first-listed move has the optimal value . In just over 90% of positions there is a unique optimal move. There are four salient ways to define the utility from these values—prefatory to involving model parameters describing the player:
Option (d) automatically scales down differences in positions where one side is significantly ahead. The same small slip that would halve one’s chances in a balanced position might only reduce 90% to 85% in a strong position or be nearly irrelevant in a losing one. I remove the most extremely unbalanced positions from samples anyway. For (c) I use a “non-sliding” scale function whose efficacy I detailed here but I can easily generate results without it. Note that if I were to cut the sample down only to balanced positions— near —of which there are a lot, then (a) and (b) become respectively equivalent to (c) and (d) anyway, up to signs which are handled by flipping the sign of .
My primary model parameter, called for “sensitivity,” is just a divisor of the values and so gets absorbed by . I have a second main parameter for “consistency” but more on it later. Having is enough to fill the dictates of the multinomial logit model in the simplest manners.
Criteria for fitting log-linear models are also a general issue. For linear regressions, least-squares fitting is distinguished by its being equivalent to maximum-likelihood estimation (MLE) under Gaussian error, but or other distance can be minimized instead. With log-linear regressions the flex is wider and MLE competes with criteria that minimize various discrepancies between quantities projected from the fitted probabilities and their actual values in the sample. Here are four of them—we will see more:
The reason for the first three in particular is that they create my three main tests for possible cheating, so I want to fit them on my training data (which now encompasses every rating level from Elo 1025 to Elo 2800 in steps of 25) to be unbiased estimators. Besides those and MLE—which here means maximizing the projected likelihood of the moves that were observed to be played (or alternately the likelihood of the observed -match/non-match sequence and various others)—my code allows composing a loss function from myriad components and weighting them ad-lib. Components unused in the fitting become the cross-checks.
Ideally, we’d like all the fitting criteria to produce similar fits—that is, close sets of fitted values for and other parameters on the same data. Finally, the code implements other modeling equations besides multinomial logit—and we’d like their results to agree too. But let’s first see how multinomial logit performs.
I analyzed the 168 official played games of London 1883 (one competitor left just after the halfway point), and separately, the 76 rejected draws, using Stockfish 7 to high depth. The former give 10,289 analyzed game turns after applying the extreme-value cutoff and a few others. Using the simple unscaled version (c) of utility and fitting according to matching gives these results for the first three fitting criteria:
Test Name ProjVal Actual Proj% Actual% z-score MoveMatch 4871.02 4871.00 47.34% 47.34% z = -0.00 EqValueMatch 5228.95 5201.00 50.82% 50.55% z = -0.73 ExpectationLoss 259.20 297.34 0.0252 0.0289 z = -10.58
This is actual output from my code, except that to avoid crowding I have elided some columns including the standard deviations on which the -scores are based. The -scores give a uniform way to judge goodness of fit. The first one is exactly zero because that was the criterion expressly fitted. The fitted model generates a projection for the second one that is higher than what actually happened at London 1883, but only slightly: the -score is within one standard deviation. The third, however, is under-projected by more than 10 standard deviations. In absolute terms it doesn’t look so bad—259 is only 13% smaller than 297—but the large -score reflects our having a lot of data. Well, there’s large and there’s huge:
Test Name ProjVal Actual Proj% Actual% z-score AvgDifference 1493.46 2780.07 0.145 0.270 z = -60.9638
The projection is only half what it should be. The -score is inconceivable.
For more cross-checks, there are the projected versus actual frequency of playing the -th best move for . Here is the table for ranks 1–10:
Rk ProjVal Actual Proj% Actual% z-score 1 4871.02 4871.00 47.34% 47.34% z = -0.00 2 1416.72 1729.00 13.80% 16.85% z = +9.44 3 761.84 951.00 7.47% 9.32% z = +7.33 4 523.25 593.00 5.19% 5.88% z = +3.20 5 401.63 410.00 4.03% 4.11% z = +0.43 6 325.30 295.00 3.29% 2.99% z = -1.73 7 272.12 247.00 2.77% 2.51% z = -1.57 8 232.05 197.00 2.37% 2.01% z = -2.36 9 200.88 169.00 2.06% 1.73% z = -2.30 10 175.95 104.00 1.81% 1.07% z = -5.54
The first row was the one fitted. Then the projections are off by three percentage points for and almost two for . For ranks 5–9 they are tantalizingly close but by they have clearly overshot—as they must for the probabilities to add to 1.
There are yet more cross-checks of even greater importance. They are the frequency with which players make errors of a given range of magnitude: a small slip, a mistake, a serious misstep, a blunder. Those results are too gruesome to show here. Fitting by MLE helps in some places but throws off the fit entirely.
The huge gaps in these and especially in the “AvgDifference” test (AD for short) rule out any patch to the log-linear model with one . I have tried adding other linear terms representing features such as a move turning an advantage into a disadvantage ( but ). They give haywire results unless the nonlinearity described next is introduced.
This is to define the utility function using a new parameter as
Without scaling this is just ; one can also use or put the power on the values separately. In forming it does not matter whether the fitted value is represented as or as . Using the notation , so that divides out the “pawn units” of and , this means that without loss of generality we can write
This makes clear that the quantity being powered is dimensionless. The motivation for is that in any quantity of the form , the marginal influence of becomes greater for large than that of . Thus can be said to govern the propensity for making large mistakes while governs the perception of small differences in value. Higher and lower correspond to higher skill. The former connotes the ability to navigate tactical minefields, the latter strategic skill of amassing small advantages.
Thus I regard as natural in chess. In my results, usually fits with values going up from below to about as the Elo level increases. This is in the rough neighborhood of square-root and definitely apart from . It also changes the calculus on a property called “independence from irrelevant alternatives,” which McFadden cites from Luce but has issues discussed e.g. here.
Since is part of the revised utility function, the model is still log-linear in the utility and the probabilities are still obtained via the procedure (1)–(3). The end-product is that having allows fitting two criteria exactly to yield them as unbiased estimators. Here are the results of fitting and AD in what is now a “log-radical()” model:
Test Name ProjVal Actual Proj% Actual% z-score MoveMatch 4870.99 4871.00 47.34% 47.34% z = +0.00 AvgDifference 2780.10 2780.07 0.2702 0.2702 z = +0.00 EqValueMatch 5261.35 5201.00 51.14% 50.55% z = -1.44 ExpectationLoss 413.42 297.35 0.0402 0.0289 z = +15.89
The equal-optimal projection remains OK. The expectation loss, however, flips from an under-projection to a vast over-projection. The cross-checks from the move ranks give further bad news:
Rk ProjVal Actual Proj% Actual% z-score 1 4870.99 4871.00 47.34% 47.34% z = +0.00 2 1123.22 1729.00 10.94% 16.85% z = +19.88 3 633.30 951.00 6.21% 9.32% z = +13.27 4 459.83 593.00 4.56% 5.88% z = +6.44 5 370.58 410.00 3.72% 4.11% z = +2.11 6 311.98 295.00 3.16% 2.99% z = -0.99 7 270.56 247.00 2.75% 2.51% z = -1.46 8 239.36 197.00 2.44% 2.01% z = -2.79 9 214.30 169.00 2.19% 1.73% z = -3.15 10 193.93 104.00 1.99% 1.07% z = -6.57
The discrepancy in the second-best move has doubled to six percentage points while the third-best move is off by more than three.
Maximum-likelihood fitting makes the gaps even worse. No re-jiggering of fitting methods nor the formula for comes anywhere close to coherence. Inconsistency in the second-best move kills everything. The fault must be tied all the way to the log-linear model for the probabilities.
As we have noted, taking the difference of logs, and inverting so that signs stay positive like so:
does not change the model. The likelihoods are still normalized arithmetically as in the Gibbs equations. Taking a difference of double logarithms, however, yields something different:
With the utility still defined as this creates a triple stack of exponentials on the right-hand side. This all looks really unnatural, but see the results it gives, now also showing the interval and large-error tests that were “too gruesome” before:
Test Name ProjVal Actual Proj% Actual% z-score MoveMatch 4871.02 4871.00 47.34% 47.34% z = -0.00 AvgScaledDiff 1142.61 1142.59 0.111 0.111 z = +0.00 EqValueMatch 5251.90 5201.00 51.04% 50.55% z = -1.10 ExpectationLoss 333.20 334.46 0.0324 0.0325 z = -0.19 Rk ProjVal Sigma Actual Proj% Actual% z-score 1 4871.02 47.02 4871.00 47.34% 47.34% z = -0.00 2 1786.89 37.32 1729.00 17.41% 16.85% z = -1.55 3 929.87 28.60 951.00 9.11% 9.32% z = +0.74 4 589.93 23.29 593.00 5.85% 5.88% z = +0.13 5 419.35 19.84 410.00 4.21% 4.11% z = -0.47 6 315.24 17.32 295.00 3.19% 2.99% z = -1.17 7 246.68 15.39 247.00 2.51% 2.51% z = +0.02 8 198.71 13.85 197.00 2.03% 2.01% z = -0.12 9 161.54 12.52 169.00 1.65% 1.73% z = +0.60 10 134.18 11.43 104.00 1.38% 1.07% z = -2.64 11 111.41 10.43 97.00 1.15% 1.00% z = -1.38 12 93.90 9.59 99.00 0.97% 1.02% z = +0.53 13 77.94 8.75 76.00 0.81% 0.79% z = -0.22 14 65.40 8.02 78.00 0.68% 0.82% z = +1.57 15 55.13 7.37 62.00 0.58% 0.65% z = +0.93 Selec. Test ProjVal Actual Proj% Actual% z-score Delta01-10 656.08 645.00 6.38% 6.27% z = -0.56 Delta11-30 800.75 824.00 7.78% 8.01% z = +1.02 Delta31-70 596.51 607.00 5.80% 5.90% z = +0.50 Delt71-150 295.70 290.00 2.87% 2.82% z = -0.38 Error>=050 709.46 675.00 6.90% 6.56% z = -1.55 Error>=100 331.88 300.00 3.23% 2.92% z = -2.02 Error>=200 141.56 114.00 1.38% 1.11% z = -2.62 Error>=400 68.76 35.00 0.67% 0.34% z = -4.61
Only the first two lines have been fitted. The other lines follow like obedient ducks—and this persists through all tournaments that I have run.
There are some wobbles that also persist: The second-best move is somewhat over-projected and the third-best move slightly under—but the remaining indices are off by small amounts whose signs seem random. So are the interval tests at the end, except that large errors are over-projected. The match to moves of equal-optimal worth tends to be over-projected regardless of the patch described here. Nevertheless, the overall fidelity under so much cross-validation is an amazing change from the log-linear cases.
The most particular issue I see grants that the original log-linear formulation could be fine for a one-shot purpose, say if the match-to- cheating test were the only thing cared about. The concern is that in the absence of validation beyond what is needed for that, “mission creep” could extend the usage unknowingly into flawed territory. It is important to me that a model should score well on a larger slate of pertinent phenomena. Do other models have as rich a field of data and cross-checks as in chess?
Is there extensive literature on modeling the double logarithms of probabilities—and on representing probabilities as powers rather than multiples of the “pivot” ? We have seen scant references. The term “log-log model” instead refers to having a logarithm on both sides, e.g., . Alternatives to log-linear models need to be more conscious of error terms in the utility functions, so perhaps uncertainty needs a more express representation in my formulas.
The form where does have the general issue that when should be very close to 1—as for a completely obvious move in chess—there is strain on getting the exponents large enough to make tiny. The over-projection of large errors ( too high) is a symptom of this. Some of my past posts give my thinking on this, but the implementations have been hard to control, so I would be grateful to hear reader thoughts.
[some name and word fixes]
]]>Composite of src1, src2 |
Andrew Odlyzko and Herman te Riele, in a 1985 paper, refuted a once widely believed conjecture from 1885 that implies the Riemann Hypothesis. The belief began a U-turn in the 1940s, and by the late 1970s the community was convinced its refutation would come from algorithmic advances to bring the needed computation into a feasible range. Odlyzko and te Riele duly credit advances in algorithms—not mere computing power—for their refutation.
Today we consider reasons for and against a belief in the Riemann Hypothesis (RH), contrast them with the situation for , and point out that there are numerous conjectures weaker than RH but wide open which an RH claimant might try to crack first.
The conjecture in question is named for Franz Mertens but was first made by Thomas Stieltjes in a letter to Charles Hermite. We have talked about it in some detail here and here. Shorn of detail, the Mertens conjecture has the form:
where and are computable—indeed . If any conjecture of this kind is false, it can be falsified by a finite computation that gives an such that .
Recall the RH says that all the zeroes of the analytically-continued complex function , other than “trivial” ones for integers , have the form for some real . There are statements of the same kind that are equivalent to RH, including this by Jeff Lagarias:
Enrico Bombieri, in his essay on RH for the Clay Mathematical Institute, details how RH (plus the condition that all the zeroes are simple) can be verified directly for up to any fixed by a finite computation. Odlyzko and te Riele and others have computed many zeroes to high precision, not only up to certain heights but in regions around select much higher values. Odlyzko and Arnold Schönhage designed an algorithm that enabled higher computations including Xavier Gourdon’s 2004 verification of RH for the first zeroes and some higher patches.
This work yields some further lessons that we feel are relevant to the current discussion over and what bearing Sir Michael Atiyah recent ideas may have on it.
Know the data. The key data in Odlyzko and te Riele’s computation came from their own extensive tabulation of zeroes of . Odlyzko’s tables are online.
Initial data may mislead. This is an old lesson but always bears repeating. The Mertens conjecture was initially believed based on its holding for incredibly many . Now we know that all up to obey and there are strong reasons for believing this continues through and maybe .
Augment computation by proof. Odlyzko and te Riele did not simply search for a counterexample . Instead their finite computation used a portion of their known zeroes, new lattice reduction techniques, and methods of complex analysis to prove that must somewhere rise above . In fact, no concrete is yet known—only an upper bound of about in a 2006 paper by te Riele with Tadej Kotnik.
This situation also still holds with regard to John Littlewood’s refutation of Bernhard Riemann’s further quasi-conjecture that the number of primes up to is strictly bounded above by the log integral , which would imply the RH. He proved that giving must exist, and his student Stanley Skewes at Cambridge extracted an astronomical but finite upper bound for from the analytical techniques. The challenge of improving the bounds and maybe finding an drew Alan Turing into drafting computational devices and methods for RH. See Odlyzko’s 2012 Turing Centennial slides for a blueprint of Turing’s “zeta machine” and his 2012 paper with Dennis Hejhal for much more.
Know the neighbors. Odlyzko and te Riele address the weaker form of the Mertens conjecture that asserts . Stieltjes thought he had proved this. They give reasons to expect that expanding their methods and their zeta dataset will falsify this too by replacing the constant to an arbitrarily high one. For a case in point, the paper with Kotnik also raised the constant to .
The RH is, however, equivalent to . Similarly, it is equivalent to
This leads to a further remark. If there is a zero with then sometimes exceeds and vice-versa. It has not even been proved that there are no zeroes with real part Such a partial result ought to be more tractable, for reasons including that the possible densities of zeroes are known to dwindle as approaches (or by symmetry), while at least 40% of the zeroes are known to be exactly on the line. Hence bounding strictly away from ought to be a first step for an RH claimant that may yield easier verification—and even if that is the only thing proved it would still be a monumental advance.
The great RH may be solved or may not. But it is interesting to see that not all believe that the RH is true. Most papers on the RH are cautious about whether it is true or not. The original paper of Riemann called it “very probable.” Littewood famously stated his belief that the RH is false; Paul Turán disbelieved it and for most of his life so did Turing. The last main chapter of John Derbyshire’s 2003 book Prime Obsession is titled with a quote given him by Odlyzko:
Either it’s true, or it isn’t.
We in computer theory have our own Clay Problem, the problem. While most seem to believe that is not equal to , some do believe that it could be the case that . One important difference from the RH is that neither side is known to be falsifiable or verifiable by a finite computation. We note the near-perfect balance between claims of and claims of on Gerhard Woeginger’s claims page. On the other hand, we noted above that a refutation of the RH that can be feasibly executed and checked might need to do argumentation rather than just a finite computation.
This suggests that it may be useful to treat the problems similarly and look at the reasons behind beliefs both for and against the RH. So let’s look at the usual arguments—we follow Wikipedia and a 2003 survey by Aleksandar Ivić on reasons to doubt the RH—with a view to how they are tending.
Several analogues of the RH over finite fields have already been proved. This is emphasized in nice detail by Bombieri’s essay and we defer to it.
Numerical evidence goes beyond verifying that the first 10 trillion zeroes obey the RH. Turing’s negative belief was not only that the RH is false but that past a certain point , counterexamples would occur with non-negligible frequency. If so, then one could refute the RH by doing relatively less computation but starting at much higher values of . Odlyzko and others have sampled such high regions and found many zeroes there. All of them obey the RH.
To be sure, there are reasons for believing that critical magnitudes for testing the RH have not yet been reached. And of course there are other conjectures in analytic number theory besides those mentioned above that have been supported by large amounts of numerical evidence yet have turned out to be false. So watch out. We note also this paper about numerical RH experiments with beautiful plots.
The probabilistic argument for the RH based on the behavior of the Möbius function. If it behaves roughly randomly, then the bound mentioned above should hold and make the RH true. See, however, these remarks by Eric Bach and Jeffrey Shallit.
Odlyzko in 1987 opened a new avenue by deep statistical testing of a conjecture by Hugh Montgomery about correlations between pairs of zeroes of . This enhanced an observation by Freeman Dyson that the correlations are the same as for random Hermitian matrices from the Gaussian unitary ensemble (GUE) often used in physics. David Hilbert and George Pólya had earlier conjectured that the zeroes correspond to eigenvalues of some positive linear operator in a way that implies the RH. This led to a spurt of optimism, represented by a 2000 paper by John Brian Conrey on the “GUE Conjecture,” for finding such an operator. This 2015 survey by Marek Wolf gives more recent news. It is titled “Will a physicist prove the Riemann Hypothesis?” and its later sections may verge on the intuitions accompanying Sir Michael Atiyah’s claims about the RH.
A flip side about these correlations, however, starts with Derrick Lehmer’s 1956 observation of a phenomenon where two zeros are sometimes very close. This leads to a possible argument that the RH is false. In January of this year, Brad Rogers and Terence Tao turned up the heat on this connection by proving that the GUE conjecture entails the existence of infinitely many Lehmer pairs. Further, they showed that an equivalent form of the RH in terms that an asymptotically-defined numerical constant being non-positive can in fact hold only if is exactly zero. Intuitively this reduces the “margin of error” for the RH to hold—see also discussion in this earlier survey.
This last point brings us right back to our question about proving partial results toward the RH. A PolyMath project on bounding from above has achieved , and a bound of is known to follow if and when the RH is verified for up to . There was previously a lower bound , a hair-breadth away from the critical value . A partial result on the way to would be proving that is not in time . Proving such a just-above-linear lower bound for wouldn’t even be placing it outside of yet it would be monumental, just as would proving that no zero has real part above . Claimers of never seem to address such simple milestones, and the same seems to be true about RH claims that we have perused.
If we plotted community belief in the RH over time on a scale from 0% likely to 100% likely, what would the graph look like? Is it still ascending, or are there signs of its rounding over as with the (big- form of the) Mertens conjecture? Odlyzko’s quote prompts us to wonder, has it ever touched the line of being 50-50?
Another issue: why do claimers always claim “it all”—why do they never claim a big partial result? Is this some deep psychological human trait?
[added Bach-Shallit reference for random argument]
]]>MacTutor biography source |
John Todd was a British geometer who worked at Cambridge for most of his life. Michael Atiyah took classes from him there. He was not Atiyah’s doctoral advisor—that was William Hodge—but he advised Roger Penrose, Atiyah’s longtime colleague at Oxford.
Today Ken and I want to add to the discussion of Atiyah’s proof of the Riemann Hypothesis (RH).
Primary sources are Atiyah’s short paper and longer precursor, the official video of his talk, and his slides. Discussion started here and has continued in several forums. MathOverflow removed their discussions; apparently so did StackExchange. A number of news sources reflect the universal skepticism.
We will not try to cover the same ground as these discussions, nor enumerate statements about holes in the papers. Instead we have gained some small insights into what Atiyah is doing. We are not disagreeing with the conclusion by many that “it’s not all there” but we think we can identify a few more things that are there—by intent—than we’ve seen noted. They don’t make a proof (either) but we think they are important to understand where all this is coming from, and that such an understanding is warranted. At the very least this is an exercise in how to read a challenging source.
We will first explain a previous proof that uses a related method—a famous proof that works and is correct. Then we will explain Atiyah’s idea as we see it.
Atiyah of course is well aware of the classic use of a special function to prove a deep theorem of complex analysis. Let’s call this the “Todd Trick.” The proof uses the existence of a complex function lambda () with certain special properties. Let’s recall the famous Liouville theorem, named after Joseph Liouville:
Theorem 1 Every bounded entire function must be a constant.
Then the famous stronger Picard theorem, named after Émile Picard, states:
Theorem 2 Every entire function that misses two points must be a constant.
Sketch of Proof: Let be an entire function that misses two points, which we may assume are and . This follows by using a linear map to move the missed points, if needed. Then the magic is to look at the following function :
It follows that is a bounded function, since it misses the poles of . But then we see that must be a constant and so is a constant.
Atiyah claims to have a proof of the famous RH. It is based on a special function that he calls the Todd function and he denotes it by . The function has a slew of special properties that Atiyah lists. Then he uses the properties to prove the RH directly.
This proof is very much in the spirit of the above proof of the Little Picard Theorem. So there is hope. But we are puzzled over some of the properties that is suppose to have. We must be confused but it seems that cannot have all the properties that are needed.
Here is the way that we think Atiyah proof is going. Consider the space of complex functions with a power series centered around some fixed point . Define
This will be our “Todd function.” Note this function is well defined on the given space and satisfies the key property
This explains how it is possible to get
with no higher terms. We use for our version of his function to mark that it is a variation of what he seems to say.
in this form is really an operator as signaled by the use of curly braces in the last equation. Note that its right hand side also equals . This is the sense we get from two pivotal equations on page 3 of Atiyah’s short paper—where, however, we use curly not . The former, to which we’ve added the label ‘(A)’, is said to apply when “ and are power series with no constant term.”
The reason we use curly is that the only way to make sense of the former equation is to read it as applied to the functions and and in the form of power series, and then the resulting function or series is applied to the complex number . It does not make sense to say that for any we evaluate and and then say that the results obey .
Our point is that the latter equation (3.1) hence needs to be read the same way—not as a simple function value where . That is, it must be what we are writing as curly- applied to as a function—indeed, as a power series. Then the result is applied to . This is neither hairsplitting nor special pleading but a need we feel as computer theorists who have used strongly-typed programming languages.
This means we want to understand as a power series. No such series appears in the papers, but given Atiyah’s surrounding references to power series it must be there. So we will try our best to supply what is indicated.
A power series for differs from its original summation formula by having in the bases rather than the exponents. There are several ways to represent as a power series. Actually, from above we want a series for the function (to be applied to ), which may or may not have the same effect as the function
with fixed. Since we are not trying to be perfect, we will mention the Laurent series around as given here:
Here the are constants named for the Dutch mathematician Thomas Stieltjes, except that is the Euler-Macheroni constant The next one is . Expanding the series around rather than separates out the pole at via the first term.
If some particular property of a power series for affects the application then this insulates against the charge that no special property of is being used. To be sure, no special property is evident in the papers, and the burden to state one is on the claimer, but an intent along these lines is more likely than a blank slate.
Now the same logic must apply to two numbered equations that appear between the two we juxtaposed in the last section. They are extra-confusing because now “” is written as a simple function without curly braces. Here they are as they appear, including a cryptic “or”:
Which is it, or ? A remark just before these latter two equations hints at the answer “both”:
Remark. Weakly analytic functions have a formal expansion as a power series near the origin. Formula 2.6 is just the linear approximation of this expansion (more precisely this is on the branched double cover of the complex -plane given by ).
So what is going on involves approximation of a power series. Thus “” must be carrying out a linear approximation of a series. Hence “” needs to be read this way. It is hard to read the right-hand side of 2.6 and the left-hand side of 2.7 with inside the square root, but we can use them to substitute:
which per above really means
with a power series for as the argument for . The Maclaurin series expansion (as given here) is
Taking away the super-linear terms leaves , which by the same intent equals as stated. That the whole series converges provided confers some legitimacy.
Here is a screenshot of the climax on page 3:
The key line is the one saying, “Now take in 2.6″—where 2.6 refers not to the equation with that number but to the one we’ve labeled ‘(A)’ which is in paragraph 2.6 of his paper. The important point in this substitution is not that is a numerical function on but rather that is to be treated as “a power series with no constant term.” This means that an application of is given as argument for another application of .
We can’t claim to have connected all the dots. We haven’t even connected the factor in from the square-root expansion (subtracting off the constant term ) to its claimed use to get . Taken at face value, the latter holding on any open region entails that must be linear. But connecting more dots helps to see fault lines more clearly, both for Atiyah’s papers and attempts on RH in general.
The emphasis on linearity in our exposition sharpens the kind of objection raised by Luboš Motl in his review: Take any two zeroes and close by each other on the line, take and and define:
This exchanges two genuine zeroes of for two mirror-image new zeroes of that are off the line by , and likewise for their complex conjugates. We have chosen and to minimize the effect on series expansions of compared to . Would the discrepancy affect the coefficient of the new linear term compared to of the original linear term for ? Surely, not enough is said about what is in the relevant series to tell, nor about any other way to distinguish from . But Motl’s example and our attention to series have at least channeled the question.
There are numerous other issues with the papers. Regarding the assertions about the fine structure constant, perhaps the argument is best left to physicists, but we note a 2010 paper by Giuseppe Dattoli. It is titled, “The Fine Structure Constant and Numerical Alchemy” and gives both a historical survey and a would-be simple formula for it. Both Dattoli and Atiyah have references to Kurt Gödel at the end of their papers. Just before the latter is a sentence that is wrong at face value: “To be explicit, the proof of RH in this paper is by contradiction and this is not accepted as valid in ZF, it does require choice.” On the contrary, RH has purely arithmetic formulations—indeed with only one unbounded quantifier per reference to Jeff Lagarias here—and all arithmetic statements (and more) provable in ZFC are provable in ZF. Nor is “by contradiction” an issue for ZF. Atiyah’s next sentence, however, talks about “most general versions” of RH and his concern about hoice might transfer to them.
Finally, we remind that some key ingredients in the essay on RH by Alain Connes, which we mentioned in the previous post, involve analyzing operators that, like , are idempotent. These have great sophistication. More down-to-earth, a calculation by Ken at the end of this recent post gives a motive for cutting off terms above where both and are small. Those terms don’t vanish in the real world but calculating in spaces where they do vanish may help clarify real behavior of limits involving them.
Is the proof’s idea okay or not? Does have the properties that are claimed? The general idea of a “Todd” approach to the RH seems at least to be an interesting idea. Can we make a list of properties that a function must have to shed light on the RH? Are we right that the Todd function is not defined on complex numbers, but is defined on functions represented by series? The most accessible reference we have found is chapter 5 of this 2004 thesis linked from this StackExchange discussion.
]]>
Why the Riemann hypothesis is hard and some other observations.
ICM 2018 “Matchmaking” source |
Michael Atiyah, as we previously posted, claims to have a proof that the Riemann Hypothesis (RH) is true. In the second half of a 9:00–10:30am session (3:00–4:30am Eastern time), he will unveil his claim.
Today we discuss what the RH is and why it is hard.
Of course we hope that it was hard and now is easy. If Atiyah is correct, and we at GLL hope that he is, then the RH becomes an “easy” problem.
Update 9/24, 8am: The website Aperiodical whose post on RH was mentioned below have a Twitter thread on the talk and another with all the slides. Atiyah has released a short paper with the main technical work contained in a second paper, “The Fine Structure Constant.” This Reddit thread has Python code for a short computation relevant to the latter. A typo in the main slide is corrected to in the paper. The analysis rests on an mapping named for John Todd and used by Friedrich Hirzebruch. It is variously represented as a mapping and as an operator on power series, and in what is evidently the latter form yields the following simple complex function in composition with zeta:
The claim is that for in the critical strip and then the analyticity of at yields the contradiction of vanishing everywhere. We will defer further comment at this time but there are opinions in the above.
In its original form, the RH concerns the simple zeta function of one complex variable :
Leonhard Euler analyzed this function for a positive integer and proved that . Further values quickly followed for positive even numbers but the nature of remained open until Roger Apéry proved it irrational in 1978.
As a function the sum converges provided the real part of is greater than , but for other complex it is definable by analytic continuation. This yields the surprising—perhaps—values , , and . The latter are the trivial zeroes of zeta. In complex analysis analytic continuation is a method that extends a function—when possible—to more values. This extension is one reason we believe that the RH is hard. The RH is the statement:
All other zeroes of have real part , that is, for some .
That is to say, there are other zeroes besides the trivial ones. The behavior of near those zeroes is not only complex but universally so. The impact of analyzing that behavior was presaged by Leonhard Euler’s discovery
For intuition, consider how converges for and picture the product of all these series over for each prime . Every finite term in the product gives the prime factorization of a unique and the exponent merely carries through.
The way in which encodes the “gene sequence of the primes” endures when is complex. There are many equivalent forms of RH, many having to do with tightness of upper or lower bounds for things—a staple of analysis but also what we like to talk about in computational complexity theory. One of our favorites is described in this post:
Here is the Möbius function, named for the same August Möbius as the strip, and defined by
In this and other ways, the zeroes of govern regularities of the distribution of the primes. One rogue zero off the line causes enough ripples to disturb the bound. But the ripples are not tsunamis and the tight RH bound is demanding a lot: an bound is equivalent both to the Prime Number Theorem (PNT) and to no zero wandering over as far as the line for real part .
Even wider ramifications emerge from this essay by Alain Connes. Connes like Atiyah is a Fields Medalist (1982) and among his later work is an extension of the famous index theorem proved by Atiyah with Isadore Singer. Connes’s essay even goes into the wild topic of structures that behave like “fields of characteristic “—but maybe not so wild, since Boolean algebra where is a simple and familiar example.
We thought we would try to give some intuition why the RH is hard. Of course like all open problems, its hard because we have not yet proved it. Every open problem is hard. But lets try and give some intuition why it is hard. Enough about “hard.”
Consider any quantity that is defined by a formula of the summation type:
Now showing that is not zero is quite difficult in general. Even if the summation is finite, showing it does not sum to is in general a tough one.
Imagine that is also equal to a product type:
Now showing that is not zero is not so impossible: If the product was a finite one, then you need only show that each term is not zero. If the product is an infinite product, then its harder. But there is at least hope.
As above, the Riemann function is indeed given by a summation as its definition. But it also is given by the product type formula. This formula is the key to proving even weaker than RH bounds on where the Riemann zeroes are located, such as for PNT.
Back to as a product. Take logarithms naively and see that we can replace studying by
This can be made to work. But. One must be quite careful when handling the logarithm over the complex numbers. Put simply, the logarithm function is multi-valued. It is only defined up to a multiple of and so must be handled very carefully.
This is the same reason that
is true. Apparently this phenomenon is one reason that many attempts at the RH have failed. At some point the proof compute two quantities say and and conclude incorrectly that they are equal. But in reality
for example. Of course there is no way that this is a mistake an Atiyah can make, but many others have run into this issue.
It may help also to discuss a product type formula that is classic.
Note that this converges for all values of . This helps one see that the zeroes of the sine function are exactly as you expected: multiples of . If there was a product formula like this for the zeta function the RH would be easy. Of course the known formulas for zeta are only convergent for values that have a real part greater than . Too bad.
On the eve of the lecture, we have not found much more information since the news broke early Thursday. There is a more-visual description of RH by Katie Steckles and Paul Taylor, who are attending the Heidelberg Laureate Forum and will be blogging from there. But we have a couple of meta-observations.
One is to compare with how Andrew Wiles’s June 1993 announcement of a proof of Fermat’s Last Theorem (FLT) was handled. Wiles was given time for three long lectures on three days of a meeting at Cambridge University under the generic title, “Elliptic Curves and Galois Representations.” He kept his cards in his hand during the first two lectures as he built the tools for his proof; the rumors of where it was going ramped up mainly before the third.
His proof had already been gone over in depth by colleagues at Princeton, initially Nicholas Katz and with John Conway helping to keep the process leak-free. Nevertheless, the proof contained a subtle error in an estimate that was not found until Katz posed followup questions three months later. Repairing the gap took a year with assistance by Richard Taylor and required a change in the strategy of the proof. We draw the following observations and contrasts:
Our final observation—have we said it already?—is that RH is hard. How hard was impressed on Ken by a story he recollects as having been told by Bernard Dwork during a undergraduate seminar at Princeton in 1980 on Norman Levinson’s proof that over one-third of the nontrivial zeroes lie on the “critical line” of real part . Different versions may be found on the Net but the one Ken recalls went approximately this way:
After giving up on a lifetime of prayers to prove Riemann, an already-famous mathematician turned to the other side for help early on a Monday. The usual price was no object, but the Devil said that because of the unusual subject he could not offer the usual same-day service. The contract was drawn up for delivery by Saturday midnight. Projecting his gratitude, the mathematician arranged a private feast for that day and exchanged his usual rumpled clothes for a tailored suit to match the Devil’s dapper figure. The sun set as he poured his wine and kept a roast pig and fine fare on the burner, but there was no sign of his companion. The clock struck 9, 10, 11, and the minute hand swept round the dial. Suddenly at the first chime a sulfrous blast through a window revealed that the Devil had missed his landing point by a dozen yards. The mathematician opened his door and through it staggered an unshaven frazzled figure, horns askew and parchments akimbo, pleading: “I just need one more lemma…”
The moral of the story is the same as in other versions: there are no brilliant mathematicians Down There.
Well we will see soon if the RH is still hard or if it is now easy—or on the road to easy. It is rare to have such excellence in our mathematical endeavors. We hope that the talk is sufficiently clear that we will be able to applaud the brilliance of Atiyah. We wish him well.
[Added update at top and made separate section “About RH”; 9:50am changed general “a” in “as” to “2” since 2 is special in the paper; fixed Euler product formula for sines; expanded observations about the Todd function in the intro including using braces in the equation defining ]
]]>Cropped from London Times 2017 source |
Michael Atiyah is giving a lecture next Monday morning at the Heidelberg Laureate Forum (HLF). It is titled, simply, “The Riemann Hypothesis.” An unsourced image of his abstract says what his title does not: that he is claiming not only a proof of Riemann but a “simple proof using a radically new approach.”
Today we discuss cases where theorems had radically simpler proofs than were first contemplated.
Sir Michael’s talk is at 9:45am Central European time, which is 3:45am Eastern. It will be live-streamed from the HLF site and will appear later on HLF’s YouTube channel. We have not found any other hard information. The titles of all talks are clickable on the program, but abstracts are not yet posted there.
Preceding him from 9:00 to 9:45am, the opening talk of the week’s proceedings, is John Hopcroft whom we know so well. If you’ve heard of the expression “a hard act to follow,” John has the opposite challenge. He is speaking on “An Introduction to AI and Deep Learning.”
Although all speakers are past winners of major prizes—Abel, Fields, Nevanlinna, Turing, and the ACM Prize in Computing—the forum is oriented forward to inspire young researchers. Our very own graduate Michael Wehar received support to attend last year, and one of this year’s young attendees is John Urschel, whom we profiled here.
Naturally, in a blog named for Kurt Gödel we should lead with him as an example. There is an often-overlooked component to the title of his famous paper on incompleteness theorems in logic:
Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.
The component does not need to be translated. It is the “I” at the end. It does not mean “System 1”—it means that this is Part One of a longer intended paper. Gödel wrote the following at the very end on page 26—OK, now we will translate:
In this work we’ve essentially limited attention to System and have only hinted at how it applies to other systems. The results in full generality will be enunciated and proved in the next installment. That work will also give a fully detailed presentation of Theorem 11, whose proof was only sketched here.
Gödel’s “Part II” was slated to stretch over 100 pages. It never appeared because the essence of Gödel’s argument was perceived and accepted quickly, details notwithstanding. Once Alan Turing’s theory of computation emerged six years later, it became possible to convey the essence in just one page:
The condition that the system is sound is technically stronger than needed, but Gödel didn’t optimize his conditions either—Barkley Rosser found a simple trick by which it suffices that the system be consistent. The Turing-based rollout can be optimized similarly. For one, we can define in such a way that the falseness of “” has a finite demonstration, and can specify “effective” so that exceptional cases are computed. For a fully sharp treatment see pages 34 and 35 of these notes by Arun Debray—scribing Ryan Williams—which I am using this term.
The relevant point of comparison is that Gödel’s essence lay unperceived for decades while the leading lights believed in the full mechanizability of mathematics. Had it slept five years more, the lectures by Max Newman that Turing attended in the spring of 1935 might have ended not with Gödel but with more on the effort to build computing machines—which Newman later joined. See how David Hilbert, more than Gödel, animates Andrew Hodges’s description of Turing’s machines on pages 96–107 of his famous biography, and see what Hodges writes in his second paragraph here. The lack of Gödel might have held Alonzo Church back more.
Then Turing would have burst out with a “simple” refutation of Hilbert’s programme via “a radically new approach.” Equating “simple” with “one page” as above is unfair—Turing computability needs development that includes Gödel’s pioneering idea of encoding programs and logical formulas. But the analogy with Atiyah still works insofar as his use of “simple” evidently bases on the following works:
We have on several occasions noted the appreciation that methods originating in quantum computing have gained wide application in areas of complexity theory that seem wholly apart from quantum. The quantum flow may run even deeper than anyone has thought.
Abel-Ruffini Theorem. Whether fifth-degree polynomial equations have a closed-form solution had been open far longer than Riemann has been. Paolo Ruffini in 1799 circulated a proof manuscript that ran over 500 pages and yet was incomplete. Niels Abel, whom we just mentioned, in 1824 gave a full sketch in only six pages, which were later expanded to twenty. This was still six years before the nature of the impossibility was enduringly expounded by Évariste Galois.
Lasker-Noether Theorem. Emanuel Lasker was not only the world chess champion but also a PhD graduate in 1900 of Max Noether, who was Emmy Noether’s father. In 1905, Lasker published a 97-page paper proving the primary decomposition theorem for polynomial ideals. Sixteen years later, Emmy Noether proved a more powerful theorem in a paper of 43 pages. But as Wikipedia’s list of long proofs states,
Lasker’s [proof] has since been simplified: modern proofs are less than a page long.
We invite readers to suggest further favorite examples. This MathOverflow thread lists many cases of finding shorter proofs, but we want to distinguish cases where the later simpler proof ushered in a powerful new theory.
The flip side is when the simpler proof comes first. These are harder to judge especially in regard to whether longer “elementary” methods could have succeeded. The Prime Number Theorem could be regarded as an example in that the original analytic proofs are elegant and came with the great 19th-century tide of using analysis on problems in number theory.
Of course the main open problem will be whether Sir Michael’s claims are correct. Even if he walks them back to saying just that he has a new approach, its viability will merit further investigation.
Coincidentally, we note today’s—now yesterday’s—Quanta Magazine article on mounting doubt about the correctness of Shinichi Mochizuki’s claimed 500-page proof of the ABC Conjecture. The challenge comes from Peter Scholze of Bonn and Jacob Stix of Frankfurt. Luboš Motl today has coverage of both stories. Scholze, a newly minted Fields Medalist, will also attend in Heidelberg next week.
Updates 9/21: Stories by IFLScience! (which I neglected to link above) and by the New Scientist. The latter has new quotes by Atiyah. John Cook has more details about both Riemann and ABC. A MathOverflow thread has been started.
]]>Famous Mathematicians source |
Niels Abel is of course a famous mathematician from the 19th century. Many mathematical objects have been named after him, including a type of group. My favorites, besides groups, are: Abel’s binomial theorem, Abel’s functions, and Abel’s summation formula. Not to mention the prize named after him, for which we congratulate Robert Langlands.
Today we will talk about commutative groups and a simple result concerning them.
A commutative group is one where the order of multiplication does not effect the value. More formally, for each and in the group
A group with this property is called an abelian group. Following our friends at Wikipedia we note that “abelian” is correctly spelled:
Among mathematical adjectives derived from the proper name of a mathematician, the word “abelian” is rare in that it is often spelled with a lowercase “a”, rather than an uppercase “A”, indicating how ubiquitous the concept is in modern mathematics.
The following is a classic but easy fact from group theory.
Lemma 1 Suppose that all the elements of a group have order at most . Then is abelian.
For instance, it is one of the first exercises in the famous John Rose book, A Course on Group Theory.
What stuck me recently is: could this lemma be optimal? Why do we require all elements to have order at most ? Why not change “all” to “most”?
My instant idea was to search for a reference via Google. But at first I could not find anything relevant so I decided to do it myself.
So let’s see what we can prove. Since we need some facts about commutative laws in groups, let’s look at a proof of the above lemma.
Proof: Suppose that and are in a group and that and and . Then
This shows that is abelian, since and are arbitrary elements.
This proof is “local” in the sense of involving only and , though they range over the whole group. It really is the following rule:
Commute Rule: Let and be so that , , and . Then .
Our next insight is that we need a way to bound how often can hold if a group is not abelian. Luckily this is well studied:
Lemma 2 Suppose that is a non-abelian finite group. Define as the probability that two randomly chosen elements and from satisfy . Then .
Our plan is simple: let’s use the Commute Rule in conjunction with this lemma. Here is our argument.
Proof: Let for all in a subset of the finite group . Now pick and randomly. If and and are in it follows by the Commute Rule that .
Let be the probability that and and are all in . Thus if is not abelian it follows that
This implies that is not too big.
Let’s bound . This is clearly by the union bound at most
Note, is the complement of the set . Thus we get that is at most
Hence, is at least
This implies that
and so that
Next,
This implies finally that
which means that if the group is not abelian
Thus we have proved:
Theorem 3 Suppose that is a finite non-abelian group. Then at most elements of can have order .
Of course it should be clear that this simple argument must be known. Well, I eventually found via Google search that similar results were indeed long known. However, the proofs of these results were not so simple as the above—at least in my opinion. Of course I have always felt that “clarity” is just another word for “it’s the argument I wrote.”
Here are some of the earlier results. For starters it is known that the “correct” answer is
Here are two references. The abstract of the former says:
One of the first exercises in group theory is that a group in which all non-identity elements have order two (so-called involutions) is abelian. An almost equally easy exercise states that a finite group is abelian if at least of its elements have order two. This cannot be improved, as the dihedral group of order eight, as well as its direct product with any elementary abelian group, provides examples of groups in which the number of involutions is exactly one less than of the group order.
There is also the paper by Bin Fu, “Testing Group Commutativity in Constant Time,” which sharpened some old work with Zeke Zalcstein as I described in this post.
These references all use the full power of group theory. Note that our DIY argument merely substitutes using the hypotheses: . Yet it really only misses the optimal answer by a small amount—recall we got . If all we want is to bound away from , then we have succeeded.
More precisely we have shown that we can replace group theory knowledge by randomness arguments. This is a recurrent theme that we have seen before. OK—not all the group theory knowledge: the argument for the “” lemma uses quotients and properties of cyclic groups. But it is simpler than the references for the optimal result. And Ken noticed something else.
Ken noticed that since the equations use no inverses, the DIY argument works equally well in a monoid.
What a monoid lacks compared to a group is inverses for every element. Commutative monoids are studied but they are usually called just that, not “abelian.” Most intriguingly, a notion of “almost commutative monoid” crops up in computing—in the theory of concurrent processes. It even has a simpler name with a Wikipedia page: “trace monoid.”
The DIY argument does imply the following: Let be a number such that for any monoid (of a certain kind) in which more than an fraction of pairs commute, the monoid is commutative. Then in any non-commutative monoid (of that kind), the fraction of involutions is at most
At first it was hard to Google for information about whether any with is known. This was mainly because of extensive literature on monoids plus an involutive operation that acts on them. An example is the monoid of strings over an alphabet and the operation of reversing a string. You can make such monoids finite by taking strings modulo a Myhill-Nerode type equivalence relation based on a minimal deterministic finite automaton (for instance, identify strings and when they induce the same mapping on states of ). These hits shadowed ones about monoids having involutions as elements.
So we put our non-Google brains to work—and those had the feeling that over all monoids there is no such . Ken thought he had a simple proof of this that involved making products of DFAs , but it drove up the proportion of commuting pairs only to .
Finally Google found us two references, the former in 1999 giving non-commutative monoids with , and the latter in 2012 pushing arbitrarily close to . The proof in the latter is not so simple.
We get and just miss the optimal answer which is . Our proof relies mostly on a well-known fact about groups and the rest is a probabilistic argument. Can we get the optimal result with a finer probabilistic argument? And what happens with probabilistic arguments—
]]>
Issues AlphaZero doesn’t need to deal with
ETS source |
Frederic Lord wrote a consequential doctoral dissertation at Princeton in 1951. He was already the director of statistical analysis for the Educational Testing Service, which was formed in Princeton in 1947. All the scoring of our SATs, GREs, and numerous other standardized tests have been influenced both by his application of classical test theory and his development in the dissertation of Item Response Theory (IRT).
Today we discuss IRT and issues of scaling that arise in my chess model. The main point is that the problems are ingrained and beautiful observed regularities burnish them rather than fix them.
This post is long, but has other takeaways including how ability in chess identifies with scaling up the perception of value, yet how value may be a detour for training chess programs, and how the presence of logistic curves everywhere doesn’t mean your main quantities of interest will follow them.
The basic component of IRT is a curve in which is a measure of aptitude or tendency and is the expected test score of somebody described by . Each item—for instance, a single question on a test or a reading of sentiment—has its own curve that looks like one of the following:
source |
Each curve has two main parameters: the placement of its symmetry point on the -axis and its slope at that point. The diagram shows all three curves centered at the origin, so , but this need not be so. Shifting a curve right lowers the expectation for every and represents a question being more difficult; shifting it left represents an easier item. The steeper the slope, the greater discrimination between levels of ability or tendency. A third parameter given equal standing by ETS is guessability. One could expect to score at least 20% on the old SAT (without the present wrong-answer penalty of 0.25) just by random guessing, so the curves might be given a lower asymptote of . Axiomatically this need not shift the expectation at up from 50% to 60%, but that is the effect of the popular logistic formula for the curves:
Our discussion starts with the scale of the -axis. It is not in units of grade-point average or any medical reading. It presumes that the population has normally distributed around some mean with some standard deviation . The values and shown on the -axis in the figure thus represent the “95%” interval around the mean. When aggregating large samples of test results one can infer this interval from the middle 95% of the scores .
This plus the translation invariance of the curves facilitate putting offerings of different tests (or editions of a test) on a common scoring scale. That’s why you’re not scored on the actual % of SAT or GRE questions you got right. We will, however, find other places in the mechanics of models where absolute values are desired.
We just posted about exactly this kind of S-shaped curve, where, however, represents the difference in ability of a chess player and one’s opponent. The value still represents the scoring expectation of the player. The curve has slope intended to confer a special meaning to the difference on the standard Elo rating scale of chess ability.
source |
Incidentally, this figure from our previous post shows that it does not make too much difference whether the S-curve is logistic as above (red) or derived from the normal distribution (green); there is a well-known conversion of about between their slope units. Using the logistic version does not countermand the assumption that the population’s ability levels are normally distributed.
The online playing site Chess.com maintains ratings for over 4.7 million players, ten times as many as the World Chess Federation, and it shows a mostly-normal distribution of ratings:
There are issues of skew: the right-hand tail is longer, higher-rated players play more games, and they are less likely to exit the population. The whole 4.7 million are skewed relative to humanity on the whole but one can also say this of SAT- and GRE-taking students. On the whole, the population assumptions of IRT apply.
The presence of an opponent differs from test-taking. There are “solitaire” versions of chess, and more broadly, compilations of chess-puzzle tests such as these by Chess.com. To be sure, the administration of those tests is not standardized. However, the whole “Intrinsic Ratings” idea of my chess model is that we can factor out the opponent by direct analysis of the quality of move choices made by “player ” in games. The administration of games in chess competitions is completely regular and draws consistent full attention from the players.
A second appearance of the S-shaped curve makes chess appear even more to conform to IRT. Amir Ban has argued that it is vital to chess programs. But we will see how the conformity is an illusion and how AlphaZero has exposed it as a digression. The curve has the same -axis but a different -axis representing position value rather than player ability. Here is an example from my previous post about these curves:
The -axis represents the advantage or disadvantage for “player ” in so-called centipawn units (here divided by 100 to mesh with the colloquial idea of being “a Pawn ahead” etc.) and the -axis shows the scoring frequency from positions of a given value . The curve has been symmetrized by plotting both for the player to move and for the player not to move, so . The and parameters (conforming to Wikipedia’s usage—note also ) are the same as in IRT. Here represents the frequency with which a player should have been checkmated but the opponent missed it; by symmetry the upper asymptote is not but and represents the frequency of blowing a completely winning game. Note that this is real data from over 100,000 moves in all recorded games where both players were within 10 Elo points of the 2000 level—thus the incredibly good logistic fit has the force of natural law.
Cross-referencing the two curve diagrams and the observation that a superiority of 150 Elo points gives about 70% expectation leads to a meaningful conclusion:
For players in the region of 2000, having 150 points more ability is just like having an extra Pawn in your pocket.
This looks like a perfect correspondence between ability and advantage. But wait—there’s a catch: That’s only valid for players at the Elo 2000 level. The slope , which governs the conversion, changes with the Elo rating . So does : weaker players blow more games. So the above “” is . That’s where the sliding-scale problem enters:
The change in slope when drawn games are removed from the sample—indicative of games like Go and Shogi in which draws are rare—is even more pronounced:
Whereas the 70% prediction from a 150-point rating difference is valid everywhere on the scale, the value of a Pawn slides. Give an extra pawn to a tyro and it matters little. Give it to Magnus Carlsen, and even if you’re his challenger Fabiano Caruana, you may as well start thinking about the next game. The shifting slope is both the main correlate of skill and the conversion factor from the centipawn values given by chess programs. Skill can thus be boiled down to the rate of the conversion—the vividness of perception of value.
Why, then, say the value axis is a digression? Chess programmers put colossal effort into designing their evaluation functions and tuning them in thousands of trial games. Yet the real goal is not to find moves of highest but rather moves giving the best expectation to win the game.
Monte Carlo tree search (MCTS) as employed by AlphaZero bypasses and trains its network by sampling results of self-play to optimize directly. The “which ?” problem disappears because it uses its evolving self as the standard. Not only the public Leela Zero project but the latest “MCTS” release of the commercial Komodo chess program have gone this route. As explained neatly by Bram Cohen, evidently earlier Komodo versions got boxed in to non-optimal minima of the design space. Cutting out the “middleman” avoids creating such holes.
My chess model is purposed not to design a champion computer chess program but to measure flesh-and-blood humans (as hopefully staying apart from champion computer chess programs). So I must grapple with dependence across all ratings . Moreover, the values output by chess programs are the only chess-specific data my model uses.
A key observation I made early on is that the average magnitude of differences between the value of the best move and the value of the played move depends not only on the player’s but also on . The higher in absolute value, the higher are all —markedly so. One might expect a higher from playing conservatively when well ahead (like “prevent defense” in football) and taking risks when well behind, but the data shows a clean affine-linear dependence clear down to . See the quartet of graphs midway through this post. Per evidence here, I treat this as a matter of perception needing correction to make differences less dependent on —and the post shows both that it flattens fairly well and makes tangible improvements.
The correction is, however, artificial, computationally cumbersome, and hard to explain. A more natural scaling seems evident from the last section’s curves: take the difference in expectations rather than raw values. Namely, use
A glance at the logistic curves shows the desired effect of damping differences when is away from and damping more when . The problem, however, is that has to be for some rating level . Which should it be?
I originally had a fourth reason for rejecting this approach: at the time, all these options gave inferior results to my and device. This enhanced my feeling against using a “reference 2000 player” in particular. Now my model has more levers to pull and the logistic-curve ideas are competitive, but still not compelling.
A simpler instance is that I want to measure the amount of challenge a player creates for the opponent. My “intrinsic ” as it stands is primarily a measure of accuracy. It penalizes enterprising strategies, ones that the computer doing my data-gathering sees how to defang but a human opponent usually won’t. Having a “Challenge Created” measure that applies to any position (with the opponent to move) might even incentivize elite players to create more fight on the board.
Since my model already generates probabilities for every possible move by , the metric is well-defined by
namely the expected (scaled) loss of value in the position that was confronted with. Here (as opposed to in the last section) is from my rating-independent metric. But the depend on the rating used for , so is really an ensemble . Again we have the choices:
In work with Tamal Biswas covered here and here, we attempted to define a non-sliding measure in terms of “swings” in values of moves as the search deepens. This is still my desire but has been shackled by model-stability issues I’ve covered there and subsequently.
A related issue comes from my desire to test my model’s projected probabilities in positions that have been reached by many players across the spectrum of ratings. The SAT has no trouble here: the same question is faced by thousands of takers at the same time. But there is no such control in chess, and popular positions become “book” that many players—even amateurs—know. The weaker players have free knowledge of what the masters did—or nowadays of what computers say to do in .
What I can do instead is cluster positions according to similar vectors of values . It is also legit to test my model by clustering the vectors of it generates. The high dimension of , the typical number of legal moves, can be reduced to a smaller by a vector similarity metric that down-weights poor moves. This doesn’t need clustering the whole space of positions and size matters more than tightness of the cluster. Yet despite having millions of data points it has been hard to find good clusters.
I’ve only done tests with , that is, on positions with two reasonable moves and , similarly spaced in value, and all other moves bad. My model’s projections have fared OK in these tests—as could be expected in such simple and numerous cases from how it is fitted to begin with. But a surprise comes from how this is also the simplest test of IRT for chess, considering to be the right answer and and everything else wrong. Thus we can observe a composite item curve for from these positions. And the consistent result is not a sigmoid curve. Rather, it looks like the left half of the logistic curve, as if the inflection point of maximum slope would come around Elo . Thus the only ability level discriminated by the “chess test” is perfection.
So the logistic law of IRT is out for chess. The logistic law of ratings works OK despite caveats here. The logistic law of value, despite being observed with incredible fidelity for each rating level in the above plots, has two more feet of clay. Ideally it should give me value conversion factors for each chess program so that my model could use one set of equations for all—and importantly, so it could pool all of the programs’ move-values together to make more-reliable projections.
But chess programs are not constrained by the law. They can do any post-processing they want of reported move values: so long as the rank order is preserved, nothing changes in the program’s playing behavior. The “calibrations” advertised by the Houdini chess program not only trip on the sliding scale but diverge from my own data for non-blitz chess at any point on it. Similar morphing of values by Komodo evidently causes the anomaly at the end of my earlier post on the “law.”
And second—where the scale slides away completely—the conversions don’t capture the different positioning of programs (and versions of the “same” program) on the landscape they share with human players. An unfortunate new cheating case last week has shown this most definitively. Thus I am resigned to having to re-jigger my equations and re-fit my model on re-run training data (a quarter million CPU core-hours per set, many thanks due to UB CCR) for each major new program release. And I wonder less at the need for continual re-centering of SAT scales.
Can you suggest a general solution to my sliding-scale problems?
I have skirted the issues of SAT and GRE re-scaling per se. The report on re-centering linked just above acknowledges large shifts in the population. One attraction of using chess is that the rating system gives a fixed benchmark and—per my joint-work evidence—has remained remarkably stable for the population at world level. Can the non-sliding standards in chess be leveraged to transfer deductions about distributions to general testing?
A further problem is that we treat both grade points and chess ratings as linear. Raising a C+ to a B- has the same effect on one’s GPA as raising an A- to an A. A 10-player chess tournament needing to raise its average rating by 3 points to reach the next category can get it equally from the bottom player raising 2210 to 2240 as the top player raising 2610 to 2640. Yet the latter lifts seem harder to achieve. Perhaps more aspects of the scale need plumbing before discussing how it slides.
[changed first word of last main section to “Doubting”]
]]>