Famous Mathematicians source |
Niels Abel is of course a famous mathematician from the 19th century. Many mathematical objects have been named after him, including a type of group. My favorites, besides groups, are: Abel’s binomial theorem, Abel’s functions, and Abel’s summation formula. Not to mention the prize named after him, for which we congratulate Robert Langlands.
Today we will talk about commutative groups and a simple result concerning them.
A commutative group is one where the order of multiplication does not effect the value. More formally, for each and in the group
A group with this property is called an abelian group. Following our friends at Wikipedia we note that “abelian” is correctly spelled:
Among mathematical adjectives derived from the proper name of a mathematician, the word “abelian” is rare in that it is often spelled with a lowercase “a”, rather than an uppercase “A”, indicating how ubiquitous the concept is in modern mathematics.
The following is a classic but easy fact from group theory.
Lemma 1 Suppose that all the elements of a group have order at most . Then is abelian.
For instance, it is one of the first exercises in the famous John Rose book, A Course on Group Theory.
What stuck me recently is: could this lemma be optimal? Why do we require all elements to have order at most ? Why not change “all” to “most”?
My instant idea was to search for a reference via Google. But at first I could not find anything relevant so I decided to do it myself.
So let’s see what we can prove. Since we need some facts about commutative laws in groups, let’s look at a proof of the above lemma.
Proof: Suppose that and are in a group and that and and . Then
This shows that is abelian, since and are arbitrary elements.
This proof is “local” in the sense of involving only and , though they range over the whole group. It really is the following rule:
Commute Rule: Let and be so that , , and . Then .
Our next insight is that we need a way to bound how often can hold if a group is not abelian. Luckily this is well studied:
Lemma 2 Suppose that is a non-abelian finite group. Define as the probability that two randomly chosen elements and from satisfy . Then .
Our plan is simple: let’s use the Commute Rule in conjunction with this lemma. Here is our argument.
Proof: Let for all in a subset of the finite group . Now pick and randomly. If and and are in it follows by the Commute Rule that .
Let be the probability that and and are all in . Thus if is not abelian it follows that
This implies that is not too big.
Let’s bound . This is clearly by the union bound at most
Note, is the complement of the set . Thus we get that is at most
Hence, is at least
This implies that
and so that
Next,
This implies finally that
which means that if the group is not abelian
Thus we have proved:
Theorem 3 Suppose that is a finite non-abelian group. Then at most elements of can have order .
Of course it should be clear that this simple argument must be known. Well, I eventually found via Google search that similar results were indeed long known. However, the proofs of these results were not so simple as the above—at least in my opinion. Of course I have always felt that “clarity” is just another word for “it’s the argument I wrote.”
Here are some of the earlier results. For starters it is known that the “correct” answer is
Here are two references. The abstract of the former says:
One of the first exercises in group theory is that a group in which all non-identity elements have order two (so-called involutions) is abelian. An almost equally easy exercise states that a finite group is abelian if at least of its elements have order two. This cannot be improved, as the dihedral group of order eight, as well as its direct product with any elementary abelian group, provides examples of groups in which the number of involutions is exactly one less than of the group order.
There is also the paper by Bin Fu, “Testing Group Commutativity in Constant Time,” which sharpened some old work with Zeke Zalcstein as I described in this post.
These references all use the full power of group theory. Note that our DIY argument merely substitutes using the hypotheses: . Yet it really only misses the optimal answer by a small amount—recall we got . If all we want is to bound away from , then we have succeeded.
More precisely we have shown that we can replace group theory knowledge by randomness arguments. This is a recurrent theme that we have seen before. OK—not all the group theory knowledge: the argument for the “” lemma uses quotients and properties of cyclic groups. But it is simpler than the references for the optimal result. And Ken noticed something else.
Ken noticed that since the equations use no inverses, the DIY argument works equally well in a monoid.
What a monoid lacks compared to a group is inverses for every element. Commutative monoids are studied but they are usually called just that, not “abelian.” Most intriguingly, a notion of “almost commutative monoid” crops up in computing—in the theory of concurrent processes. It even has a simpler name with a Wikipedia page: “trace monoid.”
The DIY argument does imply the following: Let be a number such that for any monoid (of a certain kind) in which more than an fraction of pairs commute, the monoid is commutative. Then in any non-commutative monoid (of that kind), the fraction of involutions is at most
At first it was hard to Google for information about whether any with is known. This was mainly because of extensive literature on monoids plus an involutive operation that acts on them. An example is the monoid of strings over an alphabet and the operation of reversing a string. You can make such monoids finite by taking strings modulo a Myhill-Nerode type equivalence relation based on a minimal deterministic finite automaton (for instance, identify strings and when they induce the same mapping on states of ). These hits shadowed ones about monoids having involutions as elements.
So we put our non-Google brains to work—and those had the feeling that over all monoids there is no such . Ken thought he had a simple proof of this that involved making products of DFAs , but it drove up the proportion of commuting pairs only to .
Finally Google found us two references, the former in 1999 giving non-commutative monoids with , and the latter in 2012 pushing arbitrarily close to . The proof in the latter is not so simple.
We get and just miss the optimal answer which is . Our proof relies mostly on a well-known fact about groups and the rest is a probabilistic argument. Can we get the optimal result with a finer probabilistic argument? And what happens with probabilistic arguments—
]]>
Issues AlphaZero doesn’t need to deal with
ETS source |
Frederic Lord wrote a consequential doctoral dissertation at Princeton in 1951. He was already the director of statistical analysis for the Educational Testing Service, which was formed in Princeton in 1947. All the scoring of our SATs, GREs, and numerous other standardized tests have been influenced both by his application of classical test theory and his development in the dissertation of Item Response Theory (IRT).
Today we discuss IRT and issues of scaling that arise in my chess model. The main point is that the problems are ingrained and beautiful observed regularities burnish them rather than fix them.
This post is long, but has other takeaways including how ability in chess identifies with scaling up the perception of value, yet how value may be a detour for training chess programs, and how the presence of logistic curves everywhere doesn’t mean your main quantities of interest will follow them.
The basic component of IRT is a curve in which is a measure of aptitude or tendency and is the expected test score of somebody described by . Each item—for instance, a single question on a test or a reading of sentiment—has its own curve that looks like one of the following:
source |
Each curve has two main parameters: the placement of its symmetry point on the -axis and its slope at that point. The diagram shows all three curves centered at the origin, so , but this need not be so. Shifting a curve right lowers the expectation for every and represents a question being more difficult; shifting it left represents an easier item. The steeper the slope, the greater discrimination between levels of ability or tendency. A third parameter given equal standing by ETS is guessability. One could expect to score at least 20% on the old SAT (without the present wrong-answer penalty of 0.25) just by random guessing, so the curves might be given a lower asymptote of . Axiomatically this need not shift the expectation at up from 50% to 60%, but that is the effect of the popular logistic formula for the curves:
Our discussion starts with the scale of the -axis. It is not in units of grade-point average or any medical reading. It presumes that the population has normally distributed around some mean with some standard deviation . The values and shown on the -axis in the figure thus represent the “95%” interval around the mean. When aggregating large samples of test results one can infer this interval from the middle 95% of the scores .
This plus the translation invariance of the curves facilitate putting offerings of different tests (or editions of a test) on a common scoring scale. That’s why you’re not scored on the actual % of SAT or GRE questions you got right. We will, however, find other places in the mechanics of models where absolute values are desired.
We just posted about exactly this kind of S-shaped curve, where, however, represents the difference in ability of a chess player and one’s opponent. The value still represents the scoring expectation of the player. The curve has slope intended to confer a special meaning to the difference on the standard Elo rating scale of chess ability.
source |
Incidentally, this figure from our previous post shows that it does not make too much difference whether the S-curve is logistic as above (red) or derived from the normal distribution (green); there is a well-known conversion of about between their slope units. Using the logistic version does not countermand the assumption that the population’s ability levels are normally distributed.
The online playing site Chess.com maintains ratings for over 4.7 million players, ten times as many as the World Chess Federation, and it shows a mostly-normal distribution of ratings:
There are issues of skew: the right-hand tail is longer, higher-rated players play more games, and they are less likely to exit the population. The whole 4.7 million are skewed relative to humanity on the whole but one can also say this of SAT- and GRE-taking students. On the whole, the population assumptions of IRT apply.
The presence of an opponent differs from test-taking. There are “solitaire” versions of chess, and more broadly, compilations of chess-puzzle tests such as these by Chess.com. To be sure, the administration of those tests is not standardized. However, the whole “Intrinsic Ratings” idea of my chess model is that we can factor out the opponent by direct analysis of the quality of move choices made by “player ” in games. The administration of games in chess competitions is completely regular and draws consistent full attention from the players.
A second appearance of the S-shaped curve makes chess appear even more to conform to IRT. Amir Ban has argued that it is vital to chess programs. But we will see how the conformity is an illusion and how AlphaZero has exposed it as a digression. The curve has the same -axis but a different -axis representing position value rather than player ability. Here is an example from my previous post about these curves:
The -axis represents the advantage or disadvantage for “player ” in so-called centipawn units (here divided by 100 to mesh with the colloquial idea of being “a Pawn ahead” etc.) and the -axis shows the scoring frequency from positions of a given value . The curve has been symmetrized by plotting both for the player to move and for the player not to move, so . The and parameters (conforming to Wikipedia’s usage—note also ) are the same as in IRT. Here represents the frequency with which a player should have been checkmated but the opponent missed it; by symmetry the upper asymptote is not but and represents the frequency of blowing a completely winning game. Note that this is real data from over 100,000 moves in all recorded games where both players were within 10 Elo points of the 2000 level—thus the incredibly good logistic fit has the force of natural law.
Cross-referencing the two curve diagrams and the observation that a superiority of 150 Elo points gives about 70% expectation leads to a meaningful conclusion:
For players in the region of 2000, having 150 points more ability is just like having an extra Pawn in your pocket.
This looks like a perfect correspondence between ability and advantage. But wait—there’s a catch: That’s only valid for players at the Elo 2000 level. The slope , which governs the conversion, changes with the Elo rating . So does : weaker players blow more games. So the above “” is . That’s where the sliding-scale problem enters:
The change in slope when drawn games are removed from the sample—indicative of games like Go and Shogi in which draws are rare—is even more pronounced:
Whereas the 70% prediction from a 150-point rating difference is valid everywhere on the scale, the value of a Pawn slides. Give an extra pawn to a tyro and it matters little. Give it to Magnus Carlsen, and even if you’re his challenger Fabiano Caruana, you may as well start thinking about the next game. The shifting slope is both the main correlate of skill and the conversion factor from the centipawn values given by chess programs. Skill can thus be boiled down to the rate of the conversion—the vividness of perception of value.
Why, then, say the value axis is a digression? Chess programmers put colossal effort into designing their evaluation functions and tuning them in thousands of trial games. Yet the real goal is not to find moves of highest but rather moves giving the best expectation to win the game.
Monte Carlo tree search (MCTS) as employed by AlphaZero bypasses and trains its network by sampling results of self-play to optimize directly. The “which ?” problem disappears because it uses its evolving self as the standard. Not only the public Leela Zero project but the latest “MCTS” release of the commercial Komodo chess program have gone this route. As explained neatly by Bram Cohen, evidently earlier Komodo versions got boxed in to non-optimal minima of the design space. Cutting out the “middleman” avoids creating such holes.
My chess model is purposed not to design a champion computer chess program but to measure flesh-and-blood humans (as hopefully staying apart from champion computer chess programs). So I must grapple with dependence across all ratings . Moreover, the values output by chess programs are the only chess-specific data my model uses.
A key observation I made early on is that the average magnitude of differences between the value of the best move and the value of the played move depends not only on the player’s but also on . The higher in absolute value, the higher are all —markedly so. One might expect a higher from playing conservatively when well ahead (like “prevent defense” in football) and taking risks when well behind, but the data shows a clean affine-linear dependence clear down to . See the quartet of graphs midway through this post. Per evidence here, I treat this as a matter of perception needing correction to make differences less dependent on —and the post shows both that it flattens fairly well and makes tangible improvements.
The correction is, however, artificial, computationally cumbersome, and hard to explain. A more natural scaling seems evident from the last section’s curves: take the difference in expectations rather than raw values. Namely, use
A glance at the logistic curves shows the desired effect of damping differences when is away from and damping more when . The problem, however, is that has to be for some rating level . Which should it be?
I originally had a fourth reason for rejecting this approach: at the time, all these options gave inferior results to my and device. This enhanced my feeling against using a “reference 2000 player” in particular. Now my model has more levers to pull and the logistic-curve ideas are competitive, but still not compelling.
A simpler instance is that I want to measure the amount of challenge a player creates for the opponent. My “intrinsic ” as it stands is primarily a measure of accuracy. It penalizes enterprising strategies, ones that the computer doing my data-gathering sees how to defang but a human opponent usually won’t. Having a “Challenge Created” measure that applies to any position (with the opponent to move) might even incentivize elite players to create more fight on the board.
Since my model already generates probabilities for every possible move by , the metric is well-defined by
namely the expected (scaled) loss of value in the position that was confronted with. Here (as opposed to in the last section) is from my rating-independent metric. But the depend on the rating used for , so is really an ensemble . Again we have the choices:
In work with Tamal Biswas covered here and here, we attempted to define a non-sliding measure in terms of “swings” in values of moves as the search deepens. This is still my desire but has been shackled by model-stability issues I’ve covered there and subsequently.
A related issue comes from my desire to test my model’s projected probabilities in positions that have been reached by many players across the spectrum of ratings. The SAT has no trouble here: the same question is faced by thousands of takers at the same time. But there is no such control in chess, and popular positions become “book” that many players—even amateurs—know. The weaker players have free knowledge of what the masters did—or nowadays of what computers say to do in .
What I can do instead is cluster positions according to similar vectors of values . It is also legit to test my model by clustering the vectors of it generates. The high dimension of , the typical number of legal moves, can be reduced to a smaller by a vector similarity metric that down-weights poor moves. This doesn’t need clustering the whole space of positions and size matters more than tightness of the cluster. Yet despite having millions of data points it has been hard to find good clusters.
I’ve only done tests with , that is, on positions with two reasonable moves and , similarly spaced in value, and all other moves bad. My model’s projections have fared OK in these tests—as could be expected in such simple and numerous cases from how it is fitted to begin with. But a surprise comes from how this is also the simplest test of IRT for chess, considering to be the right answer and and everything else wrong. Thus we can observe a composite item curve for from these positions. And the consistent result is not a sigmoid curve. Rather, it looks like the left half of the logistic curve, as if the inflection point of maximum slope would come around Elo . Thus the only ability level discriminated by the “chess test” is perfection.
So the logistic law of IRT is out for chess. The logistic law of ratings works OK despite caveats here. The logistic law of value, despite being observed with incredible fidelity for each rating level in the above plots, has two more feet of clay. Ideally it should give me value conversion factors for each chess program so that my model could use one set of equations for all—and importantly, so it could pool all of the programs’ move-values together to make more-reliable projections.
But chess programs are not constrained by the law. They can do any post-processing they want of reported move values: so long as the rank order is preserved, nothing changes in the program’s playing behavior. The “calibrations” advertised by the Houdini chess program not only trip on the sliding scale but diverge from my own data for non-blitz chess at any point on it. Similar morphing of values by Komodo evidently causes the anomaly at the end of my earlier post on the “law.”
And second—where the scale slides away completely—the conversions don’t capture the different positioning of programs (and versions of the “same” program) on the landscape they share with human players. An unfortunate new cheating case last week has shown this most definitively. Thus I am resigned to having to re-jigger my equations and re-fit my model on re-run training data (a quarter million CPU core-hours per set, many thanks due to UB CCR) for each major new program release. And I wonder less at the need for continual re-centering of SAT scales.
Can you suggest a general solution to my sliding-scale problems?
I have skirted the issues of SAT and GRE re-scaling per se. The report on re-centering linked just above acknowledges large shifts in the population. One attraction of using chess is that the rating system gives a fixed benchmark and—per my joint-work evidence—has remained remarkably stable for the population at world level. Can the non-sliding standards in chess be leveraged to transfer deductions about distributions to general testing?
A further problem is that we treat both grade points and chess ratings as linear. Raising a C+ to a B- has the same effect on one’s GPA as raising an A- to an A. A 10-player chess tournament needing to raise its average rating by 3 points to reach the next category can get it equally from the bottom player raising 2210 to 2240 as the top player raising 2610 to 2640. Yet the latter lifts seem harder to achieve. Perhaps more aspects of the scale need plumbing before discussing how it slides.
]]>
Can 2.3728639 be best?
Personal site; note puzzles |
Josh Alman is a graduate student at a technical school in the Boston area. He is working on matrix multiplication, among other problems in complexity theory, and is advised by Ryan Williams and Virginia Vassilevska Williams.
Today we comment on his recent paper with Virginia on the limits of approaches to matrix multiplication.
This paper is titled, “Limits on All Known Approaches to Matrix Multiplication,” and will appear in the 2018 FOCS. An earlier paper by them was presented at the 2018 Innovations in Theoretical Computer Science conference.
Since Volker Strassen’s famous result that matrix multiplication (MM) can be computed in less than cubic time, there has been great interest in discovering the best possible exponent. More precisely, what is the smallest so that MM can be done in time ? A widely discussed conjecture is that can be any number greater than . Of course the best currently is higher—much higher?—and is equal to
This is due to François Le Gall—see his paper entitled “Powers of Tensors and Fast Matrix Multiplication.”
The number as a continued fraction is
Calculated here |
Does this suggest any pattern to you? Can any number below 2.37 and above 2 be “the” number?
Since Strassen there have been several advances in reducing the value of . We discussed the history here following our two–part presentation of the advances by Andrew Stothers and Virginia. There was a time when it was thought by many that might be a barrier. It is not. Note the cluster of points near 1980 in this graphic on Wikipedia’s matrix multiplication algorithms page:
Squashed from Wikipedia source |
It is poetic that after Don Coppersmith and Shmuel Winograd (CW) broke ever so slightly, Strassen himself made a larger break in 1986. Then CW in 1991 produced a running time of whose first three digits have stood all the past 27 years despite massive efforts. It seems is stuck at around . This clearly suggests that maybe there is a reason that we cannot make further progress. So an idea is:
Are there limits to what we can prove about ?
This is precisely the focus of Josh and Virginia’s joint paper. They divide the known methods of proving better upper bounds on MM into categories according to techniques employed:
They show that the above methods cannot prove that . Namely, they show:
Theorem 1 There is an absolute constant so that any proof that uses the above methods can only prove for a value above .
Their result is not the first result on the limitations of MM approaches, however. The first “limitation” result was by Noga Alon, Amir Shpilka and Umans in a paper titled, “On Sunflowers and Matrix Multiplication.” That paper had the sobering message that several “hopeful” conjectures working toward are incompatible with versions of the Sunflower Conjecture by Paul Erdős and Richard Rado that are widely believed. Here is a graphic from that paper on the implications it proves:
Rather then show how Josh and Virginia prove their limit results perhaps it is better to explain what they have proved. As usual we will refer to their paper to get the full details.
The point, as we see it, is this. Deciding how many operations are needed to compute the matrix product boils down to determining the rank of a certain tensor. A tensor has a rank like matrices. However, the rank is hard to compute and thus the direct calculation of the rank seems to be impossible. So the above methods starting with the Laser method are methods that try to control the difficulty of determining the rank of a tensor.
An analogy is this: Imagine you are searching some large space for an optimal value of a function. This can be complex both computationally and also from a proof point of view. That is, it may be hard to even prove that particular instances have a claimed value.
A natural approach that avoids these issues is to restrict the search space in some way. Of course whenever you restrict a search space you may lose some optimal values. The paper in question shows that the tradeoff for using restricted search spaces is indeed you may lose some optimal values. But you gain because the search space is easier to understand. It may, however, be too easy: you may not be able to show that .
The strangest aspect to our eyes is that they prove the existence of but do not give an estimate for it. Here the earlier paper is highly informative. It observes that all previous advances could be made by analyzing the tri-linear forms (tensors)
where the last subscript is modulo for some fixed . In the simplest case where is prime (or a prime power), it shows that any achieved via similar analysis of is bounded below by
and where is the unique real solution in to These numbers are explicit—for instance, is about But they give as so this didn’t rule out getting by these means.
The new paper shows that getting upper bounds that are similarly asymptotic to by these (and other) methods is impossible. The proof feels like working by contradiction from this supposition. Josh and Virginia tell us that in the case of only the Galactic method applied to extended CW tensors, extracting estimates without trying to optimize them gives a lower limit of about .
Even an optimized estimate, however, would not rule out the “true” being much higher. The full paper when it is out will shed further light. For now, purely as a joke, we did a quadratic regression on the Wikipedia figure’s data points for since 1968 and got a minimum of —which the regression says would have come in the year 2004. Oh well.
What is the best value for ? Do these limit results really help us determine a value and understand why a particular value—greater than 2—might be the barrier?
]]>
A riff on writing style and rating systems
Cropped from source |
Mark Glickman is a statistician at Harvard University. With Jason Brown of Dalhousie University and Ryan Song also of Harvard—we’ll call them GBS—he has used musical stylometry to resolve questions about which Beatle wrote which parts of which songs. He is also a nonpareil designer of rating systems for chess and other games and sports.
Today we discuss wider issues and challenges arising from this kind of work.
In fact, we’ll pose a challenge right away. Let’s call it The GLL Challenge. Many posts on this blog have both our names. In most of them the writing is split quite evenly. Others like this are by just one of us. Can you find regularities in the style of the single-author ones and match them up to parts of the joint ones?
Most Beatles songs have single authors, but some were joint. Almost all the joint ones were between John Lennon and Paul McCartney, and in a number of those there are different accounts of who wrote what and how much. Here are examples of how GBS weighed in:
To convey how it works, let’s go back to the GLL Challenge. I tend to use longer words and sentences, often chaining further thoughts within a sentence when I could have stopped it at the comma. The simplest approach is just to treat my sole posts as “bags of words” and average their length. Do the same for Dick’s, and then compare blocks of the joint posts. The wider the gap you find in our sole writings, the more confidently you can ascribe blocks of our joint posts that approach one of our word-length means or the other.
For greater sophistication, you might count cases of two consecutive multisyllabic words, especially when a simple word like “long” could have replaced the second one. Then you are bagging the pairs of words while discarding information about sentence structure and sequencing. An opposite approach would be to model the probability of a word of length following a whole sequence of words of lengths . This retains sequencing information even if is small because one sequence is chained to the previous one.
GBS counted pairs—that is, transitions from one note or chord to another—but did not analyze whole musical phrases. The foremost factor, highlighted in lots of popular coverage this past month, is that McCartney’s transitions jump around whereas Lennon’s stay closer to medieval chant. Although GBS covered songs from 1962–1966 only, the contrast survives in post-1970 songs such as Lennon’s “Imagine” and “Woman” versus McCartney’s “Live and Let Die” and the refrain of “Band on the Run.”
To my ears, the verses of the last creep like Lennon, whereas Lennon’s “Watching the Wheels” has swoops like McCartney. Back when they collaborated they may have taken leaves from each other, as I sometimes channel Dick. The NPR segment ended with a query by Scott Simon about collaborative imitation to Keith Devlin, who replied:
For sure. And that’s why it’s hard for the human ear to tell the thing apart. It’s also hard for them to realize who did it and this is why actually the only reliable answer is the mathematics because no matter how much people collaborate, they’re still the same people, and they have their preferences without realizing it. [Lennon’s and McCartney’s] things come together—that works—but they were still separate little bits. The mathematics isolates those little bits that are unique to the two people.
GBS isolated 149 bits that built a confident distinguisher of Lennon versus McCartney. This raises the specter of AI revealing more about us than we ourselves can plumb, let alone already know. It leads to the wider matter of models for personnel evaluation—rating the quality of performance—and keeping them explainable.
Glickman created the rating system Glicko and partnered in the design of URS, the Universal Rating System. Rather than present them in detail we will talk about the problems they intend to solve.
The purpose is to predict the how a player will do against an opponent from the difference in their ratings and :
Here giving the probability for to win, or more generally the percentage score expectation over a series of games. The function should obey the following axioms:
The last says that the marginal value of extra skill tails off the more one is already superior to one’s opponent. Together these say is some kind of sigmoidal curve, like the red or green curve in this graphic from the “Elo Win Probability Calculator” page:
To use the calculator, pop in the difference as , choose the red curve (for US ratings) or green curve (for international ratings), and out pops the expectation . What could be simpler? Such simplicity and elegance go together. But the paradox—a kind of “Murphy’s Law”—is:
Unless the players are equally rated, the projection is certainly wrong. It overestimates the chances of the stronger player. Moreover, every projection system that obeys the above axioms has the same defect.
Here’s why: We do not know each rating exactly. Hence their difference likewise comes with a component. Thus our projection really needs to average and over a range of values. However, because is concave for , all such averages will be below .
We might think we can evade this issue by using the curves
This shifts the original curve left and right and averages them. Provided is not too big, is another sigmoid curve. Now define by aggregating the functions , say over normally distributed around . Have we solved the problem? No: still needs to obey the axioms. It still has sigmoid shape concave above . Thus will still be too high for and too low for . The following "Law"—whom to name it for?—tries not to be hyperbolic:
All simple and elegant prediction models are overconfident.
Indeed, Glickman’s own explanation on page 11 of his survey paper, “A Comprehensive Guide to Chess Ratings,” is philosophically general:
At first, this consistent overestimation of the expected score formula may seem surprising [but] it is actually a statistical property of the expected score formula.
To paraphrase what he says next: In a world with total ignorance of playing skill, we would have to put for every game. Any curve comes from a model purporting pinpoint knowledge of playing skill. Our real world is somewhere between such knowledge and ignorance. Hence we always get some interpolation of and the flat line . In chess this is really an issue: although both the red and green curve project a difference to give almost 76% expectation to the stronger player, observed results are about 72% (see Figure 6 in the survey).
The Glicko system solves this problem by giving every player a rating and an uncertainty parameter . Instead of creating ‘s and (or etc.) it keeps a separate parameter. This solves the problem by making the prediction a function of as well as , with optional further dependence on how the “glob” may skew as grows into the tail of high outliers and on other dynamics of the population of rated players.
However, Newton’s laws behave as though bodies have pinpoint mass values at their centers of gravity, no matter how the mass may “glob” around it. Trying to capture an inverse-square law for chess ratings leads to a curious calculation. Put
for . Taking gives and allows gluing . Simplifying gives a fraction with denominator and numerator given by
Then taking cancels out the two bigger terms in the constant part, leaving the numerator as
David Mumford and John Tate, in their 2015 obituary for Alexander Grothendieck, motivated Grothendieck’s use of nilpotent elements via situations where one can consider to be truly negligible—that is, to put .
Here we have an ostensibly better situation: In our original expression for , the coefficient of has to stay pretty small. The linear term for has coefficient and the term has . Thus if we could work in an algebra where
then the pinpoint value and all averages for uncertainty would exactly agree. No separate parameter would be needed.
Alas, insofar as the real world runs on real algebra rather than Grothendieck algebra, we have to keep the numerator and the denominator . One can choose to approximate the above green or red chess rating curves in various ways, and then compare the discrepancy for various combinations of and . The discrepancies for my “Newtonian” tend about twice as great as for the standard curves. That is too bad. But I still wonder whether the above calculation of the prediction discrepancy —and its curious feature—has further uses.
What will AI be able to tell from our “track records” that we cannot?
Several theories of test-taking postulate a sigmoid relationship between a student’s ability and his/her likelihood of getting a given exam question right. Changing the difficulty of the question shifts the curve left or right. For a multiple-choice question with choices the floor might be rather than to allow for “guessing” but otherwise, similar axioms hold. Inverting the various gives a grading rubric for the exam. Do outcomes tend to be bunched toward the middle more than predicted? Are exam “ratings” (that is, grades) robust enough—as chess ratings are—to tell?
Aggregating the curves for various questions on an exam involves computing weighted averages of logistic curves. Is there literature on mathematical properties of the space of such averaged curves? Is there a theory of handling discrepancy terms like my above?
[some word tweaks and typo fixes]
]]>Alexei Miasnikov, Alexander Ushakov, and Dong Wook Won are the authors of a brilliant paper, “Power Circuits, Exponential Algebra, and Time Complexity.” It and several followup papers give polynomial-time algorithms for problems in group theory where only highly exponential algorithms were previously known. The algorithms use a new scheme for manipulating highly succinct representations of certain huge integers. Regarding this, the first followup paper says:
In this sense our paper is more about compression and data structures than about algorithmic group theory.
Today Ken and I want to discuss the compression method called power circuits.
Their paper is on a topic near and dear to those of us working in computational complexity: the use of Boolean circuits to represent large objects in a succinct way. That is to represent an object of bits in a manner that requires many fewer than bits. Their circuits are algebraic but they don’t use large constants so bit-measures apply.
Of course in general, by a standard bit-counting argument, this is impossible. But it can be possible for objects with special properties. The shock—to me at least—is that this new type of representation is very reasonable and yet it is one that complexity theorists missed. The representation was, we understand, created by group theorists in order to solve a problem they had. I guess it just shows again that
Necessity is the mother of invention.
See this for a discussion of the origin of this saying. William Horman included the Latin form ‘Mater artium necessitas’ in Vulgaria, his textbook of common phrases which he published in 1519.
The short explanation is that Miasnikov, Ushakov, and Won (MUW) needed a way to represent extremely large numbers. For example, they need to handle a number like
Note that is very large and that it is not sparse. But it is easy to see that is not equal to
Take the two numbers modulo , for example, the first is and the second is .
Complexity theorists love big numbers, like the above. The best way to describe such numbers is probably the arrow notation of Don Knuth. However, the group theorists MUW needed the ability to indirectly manipulate very large numbers. They needed the following:
Definition 1 A power circuit is a usual arithmetic circuit but with the allowed operations of , , and and the constants .
This means that a power circuit can compute the sum or difference of previous results. But it cannot do multiplication. It can, however, take a previous number and shift it by a previous value . Thus it can compute in binary the number via
where and are previous results. In particular,
is easy to define in this representation. They introduce their new type of circuits and then prove the following cool theorem:
Theorem 2 There is a polynomial time algorithm that determines whether one power circuit represents the same number as another power circuit.
The surprise, in my opinion, is that they can represent and manipulate the binary encodings of many truly immense numbers. Think towers of . That one can determine whether two such expressions are equal in polynomial time, is quite neat.
They then provide several interesting applications. They show that the quantifier-free theories of some exponential algebras are decidable in polynomial time, as well as the word problems in some “hard to crack” one-relator groups.
Let’s take a look at the one-relator problem, since my understanding is that this is the main motivation for the creation of this cool type of representation.
Emil Post and Andrey Markov Jr. independently proved the following theorem:
Theorem 3 There exists a finitely presented monoid for which there is no algorithm to solve the word problem.
A trouble that plagued this area for many years, was that going from monoids to groups was tricky. A group is a much more fragile object than a monoid. This makes the encoding of computation into a group quite challenging. Finally Pyotr Novikov and William Boone showed there is a finitely presented group with an undecidable word problem. A indicator that this was a major result is that the original proof of Novikov is 143 pages long.
Of course the word problem is:
Given a finite presentation of some fixed group , decide whether an input word represents the identity element.
Since Novikov-Boone showed that the general word problem is undecidable, one quest is to find classes of groups that have decidable word problems. An important result concerns the case where the presentation has one and only one relationship:
These groups are not surprisingly called one-relator groups. See this post for a great discussion about a famous theorem by Wilhelm Magnus:
Theorem 4 The word problem for a one-relator group is decidable.
Note, this was proved in 1930 before the modern theory of unsolvability. It is possible to state positive computability theorems without any theory, but to state uncomputability theorems requires the whole modern theory of what is computable.
The following one-relator group has a solvable word problem, by Magnus’s theorem, but until recently it was not known to have a polynomial time word problem.
Actually it is
as a one-relator group presentation. The time complexity of the algorithm implicit in Magnus’s theorem is terrible, and takes more than a tower of time. It was a breakthrough when this was reduced to exponential time. But MUW used the power circuit idea to reduce the cost of the word problem for this group to polynomial time. Wonderful.
The running times are also “improvable.” Originally they had for this word problem, but a followup paper reduced the time to . The same paper gives an time algorithm for the word problem in a 4-relator group devised by Graham Higman:
These results are not completely ad hoc—they come with general theorems about decidability of certain quantifier-free theories. But the theory needs tuning for the application. The authors of this paper needed a different kind of compressing circuits to tackle group word problems involving Ackermann growth rather than merely tower growth. They note that their circuits however have fewer closure properties than power circuits. The closure issue seems at first to be a paradox. Why not just add multiplication to power circuits? The answer is that adding this is incompatible with the ability to tell feasibly whether two power circuits are equal. If we allow multiplication then it is impossible to keep this critical property.
Power circuits have an even stranger situation. For any there are numbers that are multiples of and yet they have no power circuit of size less than a tower of 2’s. By diagonalizing over one can get super-elementary growth. Thus it is not feasible to solve simple linear equations using this model—see their paper for details.
The moral is that MUW tailored the definition of power circuits to include exactly what they needed to solve their group word problem. They left out multiplication because they did not need it; and they left it out because with it they could not solve their group word problem. Brilliant.
Why did we—computer theorists—completely miss this idea? Why indeed.
A mark of maturing literature for power circuits is their inclusion in a collection that serves as a textbook. To track their development, please use the search
power circuits and one relator group
to avoid hitting electrical engineering papers.
]]>
A great choice
Cropped from 2016 KTH grant news source |
Johan Håstad is the winner of the 2018 Donald E. Knuth Prize. We were going to keep you in suspense, but one of our “blog invariants” is to lead with the name(s) of those who are featured. He is very well deserving. And he has proved many wonderful theorems.
Today the entire GLL editorial staff would like to thank the committee for their great selection.
They were Allan Borodin, Alan Frieze, Avrim Blum, Shafi Goldwasser, Noam Nisan, and Shanghua Teng (chair). It must be hard to select a Knuth Prize winner, because there are so many strong candidates. So many. This year’s choice is an excellent one.
The Knuth Prize is “for outstanding contributions to the foundations of computer science.” The description used to mention that educational contributions such as textbooks and students could be considered as part of the impact. Usually one thinks of “education” as being for aspiring students or for the public outside of us researchers. But it strikes us that Johan has been pre-eminent for educating many currently within the field on what to pursue.
We draw our feeling on Johan’s role from the first two paragraphs about his “transformative techniques” in the long-form Knuth Prize citation. It first describes his famous 1986 “Switching Lemma” for lower bounds on the parity function. The first super-polynomial lower bounds on constant-depth Boolean circuits (of unbounded fan-in) had been shown in 1981 by Merrick Furst, James Saxe, and Mike Sipser. Andy Yao in 1985 obtained exponential size lower bounds that were strong enough to give oracles separating the polynomial hierarchy from polynomial space. But Johan sharpened the exponent using what has remained the best technique—even this 2018 extension tells us so.
His lemma concerns how a random assignment to many of the variables of a DNF causes it to become much simpler. See here for a modern explanation of the details. His original proof was a great achievement through its handling of the probabilistic method. The details were quite delicate because of the need to control certain technical issues. In any probabilistic proof one must be careful not to destroy independence, since keeping certain events independent is vital to make such proofs work. Johan’s original proof worked hard at this. More recent proofs have been able to skirt some of the vicissitudes, but they all prove the following pretty result:
Theorem 1 Any Boolean circuit of depth for parity requires size at least
for some positive constant .
The citation’s next paragraph describes Johan’s relation to the PCP Theorem. Now he is absent from that Wikipedia page and from a popular illustrated history. But he refined it to plutonium. This did not come suddenly with the 1996 paper, “Clique is hard to approximate within ” and its Gödel Prize-winning 1997 successor, “Some Optimal Inapproximability Results” mentioned in the citation. I (Ken adding to what Dick wrote) recollect associating ‘Håstad’ to the importance of “free bits” earlier in the 1990s and the influence of his FOCS 1993 paper, “The Shrinkage Exponent is 2.” This memory may include a conflation of the influence on Håstad of Ran Raz’s STOC 1995 paper on the parallel repetition theorem. In any event, the power of finding best-possible results shines through in the citation:
As complexity-theoretical breakthroughs, Håstad constructed almost optimal PCPs, where optimality holds with respect to parameters such as amortized free-bit complexity and total number of queries. He then established optimal “approximation resistance” of various constraint satisfaction problems—namely that one cannot do better in terms of worst-case performance ratio than the basic input-oblivious algorithm that simply picks a random assignment to the variables. These PCPs led to optimal inapproximability results for MaxClique, MaxLin2, and Max3SAT as well as to the best known hardness results for approximating other optimization problems such as MaxCut. His proofs introduced a treasure trove of ideas—in particular the Fourier analytic techniques—that has influenced nearly all subsequent work in hardness of approximation.
The citation’s third paragraph notes his role in proving the polynomial-time equivalence of one-way functions and pseudorandom generators. The history is that Russell Impagliazzo, Leonid Levin, and Mike Luby proved this first for nonuniform circuits—I remember a seminar talk by Russ at Cornell in late 1986 or 1987 when this was still in process. Then Johan for STOC 1990 saw how to push it to work in the uniform case.
What more is known about the constant in the lower bounds for parity, as grows? We find (all ) here, and a simple upper bound asymptotic to is easy to see as done here.
Johan recently pushed the lower bounds on parity in a different direction, improving the maximum such that circuits of depth and size can achieve agreement with the parity function. What further usefulness in complexity theory might pushing this have? In any event we all thank him for his beautiful work and congratulate him on winning this year’s Knuth Prize.
[some word changes]
]]>By permission of Rich Longmore, artist, blog source |
Alan Turing presaged Stephen Cook’s proof of -completeness of Turing reduced the halting problem for his machines to the decision problem of the first-order predicate calculus, thus showing (alongside Alonzo Church) that the latter is undecidable. Cook imitated Turing’s reduction at the level of propositional calculus and with a nondeterministic machine.
Today we present a version of Turing’s proof and show how it answers a side issue raised in our previous post.
Completely aside from the ongoing work by Harvey Friedman previewed in that post, we—that is Dick and Ken—were interested in the boundary between decidable and undecidable theories in logic. We noted Michael Rabin’s theorem that the theory of the rationals with order is decidable with first-order quantifiers and second-order quantifiers allowed only over unary predicates—that is, over subsets of We sketched how quantifiers over ternary predicates—that is, over subsets of —enable defining properties of and from which undecidability follows by the mentioned theorem of Julia Robinson.
This left the binary case. Most in particular, we were interested in the decision problem for second-order sentences of the form
where the only predicate symbols in (besides and ) are and has only first-order quantifiers. We were given references to papers by László Kalmár and William Quine but judging from this by Church and Quine (see page 183 and also this and this plus these slides) we wonder what extra is involved, for instance pairing. Our search for a crisp and brief proof led back to Turing’s first-order theorem.
Our starting point is to build a predicate so that will define a set with properties needed of the integers. Our intent is to make define a kind of ceiling function the least “integer” The first rules we write say that is a functional relationship:
and
Thus defines the desired function The further rules we need are:
This means that is constant on an interval.
Now all we need is to define the successor relation via more rules and postulate an element to act as zero. The rules for successor are:
Among several ways to handle we find it most thematic is to leave it unquantified and assert This will be on the left-hand side of implications of the form Any interpretation—that is, any assignment of a function on for and a value for —that makes all the parts true will make the bottom of a stairway that has infinitely many steps above Let be the conjunction of and all the above rules.
The last rule for might seem to imply the stairway climbs to infinity. As we’ll discuss, this is not enough to make a single “stairway to heaven.” But it does provide that any angel who climbs some number of steps above will need exactly steps to get back down to
Proofs of Turing’s theorem such as this implicitly use relations like and for binary strings on Turing machine tapes, which numerically work out to and We will use a universal two-counter machine instead. We don’t need the arithmetic details of why it is universal, such as Marvin Minsky’s coding trick. The machine input is in unary but we can just define it as a natural number To represent the current state and the counters at any time , we use three more binary predicates:
We can write rules to make these functions of The program has a finite number of states and so we can (if we wish) use dedicated symbols for them. We can take as the start state. Following Minsky, we need only two kinds of instructions:
Each state has exactly one of these instructions except the halting state The rules corresponding to these instructions all have the form
where is
In case the instruction for state is we have as
whereas in case the instruction is a conditional jump we get as
We conjoin all the machine rules into one finite formula The conditions to start in state on a given input value are:
Here we mean to be a fixed natural number but we can’t say so literally, so what includes are terms until we get to the fixed value Finally, the condition to halt eventually in state is:
Now given any number , take
We want to decide whether is valid. Now is first-order: we haven’t put in the existential quantifiers over the predicates yet. To be valid in our setting means that whatever relations on are used for the four predicates and whatever value is used for , the resulting statement about rational numbers is true.
Theorem 1 The formula is valid if and only if on input halts in state Hence first-order predicate calculus is undecidable.
Proof: First suppose halts. We need only consider interpretations that make true. Such interpretations set up the initial configuration of with standing for zero and make every instance of the universally quantified implications implications in true. The chain of those instances eventually yields where is the number of steps takes to halt. Thus and hence overall are made true.
Conversely, suppose is valid. Then we can pick the interpretation that makes the actual integer ceiling function and This makes true so it must make true with ranging over the positive integers, which means that must halt.
We seem to have sacrificed some generality by basing on , but in fact we used nothing about the rationals except being a total order (including trichotomy: ). We could axiomatize those properties too. Alternatively, we could forgo equality in favor of having function symbols as in the Turing machine-based proof mentioned above.
Now we put the second-order quantifiers on the predicates in order to prove our desired theorem:
Theorem 2 Consider sentences of the form
where is first-order. The problem of deciding whether is true in is undecidable.
A natural impulse is to take for a given input as before. This leads to a notable issue, one that Ronald Fagin reminded us of. If halts, we get that (for that ) is true. Indeed—and this is a first hint that wires are crossed from the above proof—we can use the genuine integer ceiling function for the “” part and the rest follows.
Conversely, suppose does not halt. The rub is that we can still make true. Take such that the derived set , our set of “integers,”includes Take Then is the successor of for all but is not the successor of anything. Although the values of for are fixed for all by actions of the machine, nothing prevents us from blithely defining to be true. This makes true too.
We forgot about the rule that grows unboundedly but it doesn’t help: we can just add the rest of the natural numbers to Attempts to fix this by running backwards from state are forestalled by the possibility of adding to for all as well—our axioms are too weak to prevent having infinitely many predecessors.
The upshot is that although we can will into existence multiple infinite staircases, we cannot guarantee a single staircase that goes to heaven. In broad terms, we cannot rule out taking a nonstandard integer number of steps to halt (for connections to start here). But in this situation we can fix the problem by a simple twist (used also here).
Proof: Define by asserting instead that doesn’t halt. That is, take
If doesn’t halt, then the standard assignment to giving and works for that as well. The converse is now to suppose that does halt in state We need to show that there is no way to instantiate the four binary predicates and to make the resulting sentence true. By it sets up the initial configuration. The part leans only on features of that the instantiation must make true. Thus successive configurations are forced by the implications in to follow correctly, so that is made true for some that does belong to the staircase above This immediately contradicts of So is false.
Thus is true if and only if does not halt.
This finally answers the last post’s desire for showing concretely that second-order existential is undecidable with binary predicates, needing no reference other than Turing (OK: and Minsky). We would still like to trace this historically through the above-mentioned references to Kalmár and Quine and perhaps others. In contrast to great past attention to minimizing the alternation and number of quantifiers, we are curious also about minimizing the number of binary predicates. We note how this 1984 paper by Warren Goldfarb gets the number of binary predicates down to one but at a greater cost in unary predicates, of which above we’ve used none. We are also not sure how our setting with domain and a fixed meaning for plays in the tradeoffs.
What is the best reference? Is there a nice and clean proof for the case of just one dyadic predicate? What other tradeoffs, such as between function symbols and predicates using equality (or exists-unique quantifiers), are in play here?
[added reference to Goldfarb paper and changed next sentence; changed picture at top with permission notice]
]]>Harvey Friedman is a long-standing friend who is a world expert on proofs and the power of various logics. This part of mathematics is tightly connected to complexity theory. This first mention of his work on this blog was a problem about the rationals that says nothing of “logic” or “proofs.”
Today Ken and I would like to introduce some of Harvey’s work in progress. Update 8/16/18: Harvey has posted a full draft paper, “Tangible Mathematical Incompleteness of ZFC.”
Harvey’s work is a kind of quest. Recall a quest is a a long or arduous search for something. Ever since his 1967 PhD thesis, Harvey has been trying to find simple statements that are immensely hard to prove. This might seem to be a strange quest: Why look for hard statements; why not look for easy statements? Indeed. Since Kurt Gödel’s seminal work, logicians have been interested in finding natural statements that hide great complexity underneath. The measure of complexity is that the statements are true but cannot be proved in certain prominent logical theories.
Wikipedia has this to say about the famous theorem of Jeff Paris and Leo Harrington (with input from Laurence Kirby) that Peano Arithmetic (PA) cannot prove a certain strengthening of the finite Ramsey theorem:
This was the first “natural” example of a true statement about the integers that could be stated in the language of arithmetic, but not proved in Peano arithmetic; it was already known that such statements existed by Gödel’s first incompleteness theorem.
See also this for history and related examples including earlier work by Reuben Goodstein that was developed further by Kirby and Paris.
What if we can find statements that (i) use arguably simpler notions than arithmetic, yet (ii) are unprovable in systems such as ZFC set theory that are vastly stronger than PA? They might require systems beyond ZFC, such as adding plausible “large cardinal” axioms, to prove. The foundational goal of Harvey’s quest is to demonstrate that realms of higher set theory long regarded as wildly infinitistic and intangible connect algorithmically to concrete and intimately familiar combinatorial ideas. He does also mention motives in complexity, such as in this recent popular talk, and we will develop this aspect here.
Introductory theory courses show how to define sets that are computably enumerable but not decidable. The complement is then definable by a formula whose only unbounded quantifiers over numbers are universal. Such formulas and the resulting sentences for any fixed value are said to be . For instance, the set of numbers coding Turing machines that accept no inputs is defined by saying that for all and , is not a valid accepting computation of the machine on input . For every effective theory , be it PA or ZFC or stronger, the set of cases that can prove is computably enumerable. Thus cannot equal . Presuming does not prove any false cases, this makes a proper subset of . Every (and there are infinitely many) yields a sentence that is true but not provable in .
Indeed, one can construct particular values via Gödel’s diagonalization mechanism. The resulting sentence is simple in terms of quantifiers, but the “” is definitely not simple. It bakes all the complexity of diagonalizing over proofs into the statement. Gödel further showed one can use a sentence expressing the consistency of , but this too references and proofs explicitly.
The Paris-Harrington statements are natural, but they are not —that is, they have the form “(for all )(there exists )” where is first-order with bounded quantifiers only. They embody the idea that functions giving for these quantifiers must grow infinitely often faster than any function PA can prove to be computable. The statement is likewise . Some time back we wrote a post on independence and complexity that also covered work by Harvey with Florian Pelupessy related to Paris-Harrington. In 1991, Shai Ben-David and Shai Halevi proved a two-way connection to circuit complexity that applies when an effective first-order theory is augmented to a theory by adding all true sentences over its language as axioms. They cited as folklore a theorem that and generally have the same sets of provably computable functions and said in consequence:
[I]f is a natural mathematical statement for which “ is independent of PA” is provable in any of the known methods for establishing such results, then is independent of PA as well. [Likewise for ZFC and any similar and its corresponding .]
Well this is not in Harvey’s quest either. Here are several objections: Statements of Paris-Harrington type cannot be brought much below . The theory is not effective. Adding true numerical sentences may be benign for provable growth rates but not for and other statements we want to prove. Although it is often kosher to assume particular statements like (forms of) the Riemann Hypothesis, allowing arbitrary ones begs the idea of proof. The situation is that either is stronger than PA and the functions grow so insanely as to strain naturalness, or is weaker but then the independence is weak too.
What we all want is to show independence from strong theories for simple and natural statements that while not literally in form demonstrably have no more power. Then we may hope to apply his results to understand the hardness of statements like at the ground level. According to Ben-David and Halevi, this will require ideas and constructions beyond “the known methods.” Harvey’s ideas begin by blending questions about the integers into questions about the rationals that leverage the latter’s dense order properties. Let’s turn to look at them now.
Consider the rationals . How hard are statements about the rationals? It depends.
If we allow first order sentences that involve addition and multiplication over the rationals, then Julia Robinson proved, long ago, that the integers can be defined in this first order theory. This immediately implies the general unprovability results discussed above. We have covered this topic before here.
What is surprising, perhaps, is that there are complex statements that use only the order properties of the rationals. However, if we restrict ourselves to the first order theory of the rationals under the usual ordering , then it is long known that:
Theorem 1 The first order theory is decidable.
Harvey’s ideas use mildly second-order statements about the rationals—with little or no arithmetic besides the relation—to get his hard statements. This may be surprising or may not—it depends on your viewpoint.
Michael Rabin further showed that the second-order monadic theory of is decidable (see also this). So we need more than quantifiers over sets of rationals. If one allows an existential quantifier over triples, then it does not take much else to write down the rules for relations and to behave like addition and multiplication and use the quantifier to assert their existence, whereupon undecidability follows by Robinson’s result. It seems to us that this paper gives a ready way to show the undecidability also for pairs.
Harvey uses predicates defining sets of tuples of rationals and a single quantifier over them. He eliminates that quantifier by expressing the formula body using satisfiability in first-order predicate calculus with equality, obtaining the schematic form
where uses only first-order quantifiers to talk about finite sets of -tuples and is computable from . The satisfaction relation is equivalent to the negation of a derivable contradiction. Hence is semantically equivalent to a sentence . The point is that whereas would encode proofs or computations, is simple and free of navel-gazing.
Harvey’s work is based on the standard notion of order equivalence:
Definition 2 Say is order equivalent to if
for all indices and . We use to denote this.
Theorem 3 Order equivalence on -tuples is an equivalence relation and there are only a polynomial in equivalence classes.
Of course, if the tuples are restricted to distinct rationals then there are exactly equivalence classes. The exact numbers of classes for each when members can be equal are addressed in this paper.
Now we can define the key binary relation on sets of -tuples of rationals. It uses order equivalence on -tuples obtained by flattening elements of and :
Definition 4 Let be fixed. Then provided
Harvey calls this emulation and the equivalence he calls duplication. The notions naturally extend to and for fixed , when they are called –emulation and –duplication. In applications, will be finite while can be huge, but in writing we still regard as being constrained by .
The third key concept is a kind of invariance that is customizable. Its calibrations may determine the level of hardness of the statements obtained. They require identifying some elements of a -tuple as being integers. Here is an example:
For all sequences of nonnegative rationals, each less than the integer where ,
The case says that for all ,
.
A stronger constraint is
whenever . Here stand for integers that are not distinct and like are not required to be in nondescending order. Harvey calls these various notions of being stable.
The final core ingredient is the usual notion of being maximal with respect to some property: has the property but no proper superset of does. It will suffice to consider sets obtained by adding one element to . Now we can exemplify his statements in simple terms:
For every finite set of -tuples of rationals there is a set that is maximal with regard to and is stable.
There are similar statements for duplication and other variants of emulation. Their definitive treatment will come in papers after #106 and #107 on Harvey’s manuscripts page.
The above concepts are all basic. They figure into many areas of combinatorial mathematics, for instance regarding preferences and rankings. None talks about Gödel numbers or Turing machines or proofs. The goal is to establish and classify instances of this phenomenon:
There are simple true statements involving little more than and that require immensely strong set theories to prove.
Here “true” needs some explanation. If ordinary set theory cannot prove a statement , with what confidence can we say is true? The answer is that the aforementioned large cardinal axioms which Harvey uses to prove are widely accepted. We note this 2014 post by Bill Gasarch on how Fermat’s Last Theorem was originally proved using theory derived from the axiom asserting the existence of a strongly inaccessible cardinal. This is equivalent to having a Grothendieck universe, by which Alexander Grothendieck proved statements used by Andrew Wiles in his proof. Colin McLarty later showed how to remove this dependency and prove Fermat in a fragment of Ernst Zermelo’s set theory—but the relevance of to “real mathematics” remains.
So the upshot will be that ZFC plus some will prove but ZFC alone cannot (unless ZFC is inconsistent). It is possible for ZFC to be consistent whereas ZFC is inconsistent. All the known large cardinal axioms have these relationships to ZFC:
The final—and long—step needed in Harvey’s proofs is to get point 1 with in place of . This exemplifies a reversal step in his program of Reverse Mathematics. From this, the above desired conclusion about ZFC not proving follows. By point 2, ZFC cannot prove either. Nevertheless, we can accept as true. This is intended to rest not only on heuristic arguments in favor of the particular employed, as exemplified for Fermat, but on evidence that can be programmed.
Harvey’s ultimate goal will be realized when the effect of is shown to be algorithmic. Here is a facile analogy. Suppose we build an infinite set by a simple greedy algorithm. We get a “simple path” of finite sets that grow successively larger: each has cardinality . We can view the process as structured by the ordinal being the limit of as grows. The idea is that more-ramified search structures can be “governed” by higher ordinals and patterns of descents from them. The end product will nevertheless always be a countable set of tuples and the evidence of progress will always be a longer finite path. Greedy may give only maximality, whereas the large cardinal axiom implies the existence of an infinite with all stipulated properties. Though it may provide no guidance on building longer and longer finite paths, it implies their existence so that computers can always find a longer path by a finite search. The concrete import of this and further motivations are described in this Nautilus article from last year.
Open problems are most engaging when easy to state: the simple attracts and the complex repels. Simplicity allows them to be stated also in the direction that most of us believe. Yet we may think they are true but don’t know how to prove them. The whole point of Harvey’s quest is to create more milestones that are simple but hard to prove statements. How much will it help to have such statements?
Update 8/16/18: As noted also above, Harvey has posted a 68-page paper with full details. We will be giving it further coverage with his continued expository assistance.
]]>Isaacs honorary conference source |
Martin Isaacs is a group theorist emeritus from the University of Wisconsin, Madison. I just picked up a copy of his 2008 book, Finite Group Theory. Simple title; great book.
Today I want to make a few short comments about group theory.
I always hated group theory. Well not really hated it. But I have never been able to get decent intuition about groups.
Ken has a similar issue and puts it this way: We are “spoiled” by first learning about fields and rings with as all-pervading examples, then and for prime and non-prime . Each includes an additive group and (for the fields) a multiplicative group but is so much more. They combine to form other spaces such as and .
Groups alone seem neither to command such sway as individual mathematical objects nor to combine so prolifically. The group-based analogue of a vector space is called a module and feels like a similar “hole” in our intuition.
Ken remembers Martin, Marty, well having known him during Marty’s visit to Oxford in 1984. Marty was often found in the lobby or the tea room of the old Mathematical Institute animatedly talking about open problems and ideas for teaching. His “group” included Graham Higman, Peter Neumann, Hilary Priestley, Robin Gandy, and anyone who happened to stop by. Ken remembers geniality, openness, and great ways of putting things.
Of course group theory is important in many parts of complexity theory. Here are two important examples of group theory open questions:
The latter is of course asking if a solvable group can be used like a non-abelian simple group in computing via bounded width programs. We have discussed both of these questions before: check out
this and this.
I definitely would suggest you look at Issacs’s book if you are at all interested in getting some insight into group theory. I have been looking at it and now feel that I have intuition about groups. Not much. But some. The issue is all mine, since the book is so well written.
Issacs makes a cool remark in his book on group theory. Suppose that is a group with a non-trivial normal subgroup . Then often we can conclude that at least one of or is solvable. This can be done in the case that the orders of and are co-prime. The proof of this is quite amusing: it depends on two theorems:
Theorem 1 If and are co-prime numbers, then at least one of and is odd.
Theorem 2 Every group of odd order is solvable.
One of these is very trivial and the other is very non-trivial. I trust you know which is which.
Group theory uses second order reasoning, even for elementary results, quite often. This sets it apart from other parts of mathematics. A typical group theory result is often proved by the following paradigm:
Let be a subgroup of that is largest with regard to some property . Then show that dose not exist. or that must also have some other property .
When we are talking about finite groups then of course there are only finitely many subgroups and so the notion of “largest” involves nothing subtle. Moreover, one can transfer all this into first-order sentences about their string encodings. But the theory is really naturally second-order. For infinite groups the notion of maximal can be really tricky. For example, the group of all complex roots of unity over has no maximal subgroups at all.
Should we teach groups and modules in a richer fashion? Is it really hard to get intuition in group theory? Or is that just an example of why mathematics is hard?
[fixed name in line 1; fixed to “N and G/N” in section 3]
]]>Maruti Ram Murty is a famous number theorist at Queen’s University in Kingston, Canada. He is a prolific author of books. His webpage has thumbnails of over a dozen. He has an Erdős number of 1 from two papers—very impressive.
Today Ken and I want to talk about a not-so-recent result of his that is also a “lower bound” type result.
Murty’s theorem in question is referenced in his 2006 paper with Nithum Thain:
Theorem 1 A ‘Euclidean’ proof exists to show there are infinitely many primes congruent to modulo if and only if
What he proves essentially is a “lower bound proof.” This theorem gives a limit to the famous Euclidean proof method showing an infinite number of primes. Let’s take a look at it.
Say that a residue modulo is abundant provided there are infinitely many primes congruent to modulo . Then must be relatively prime to , so at most residues can be abundant. Gustav Dirichlet proved in 1837 that all of those are:
Theorem 2 For all the number of residue classes that are abundant is .
So why do we care about “Euclidean” proofs? Dirichlet’s proof uses complex analysis—see this for example. As with the Prime Number Theorem there has long been a feeling that the natural numbers should divulge their secrets by more “elementary” means. Atle Selberg found analysis-free proofs of both theorems—see our discussion here. Opinions remain that the structures involved in these and similar proofs do not shed more light on the structure of the primes.
What is an appropriate level of proof complexity? This is subjective. What we can do is enumerate techniques that everyone agrees convey numerical beauty. Among them are quadratic reciprocity (QR) and cyclotomic theory (CT), albeit they hail from the time of Leonhard Euler and Adrien-Marie Legendre and Carl Gauss rather than Euclid. Indeed, what Gauss considered his nicest proof of QR used CT. With that in mind, the following arguments are considered “Euclidean” (we follow these notes by Keith Conrad):
Theorem 3 For every , the residue class modulo is abundant.
Proof Sketch: Take to be the -th cyclotomic polynomial. The CT theorems we lean on are that for every , and every divisor of either divides or is congruent to modulo . This first gives us the fact of having some prime in that residue class. Now suppose were all such primes and consider some prime divisor of where . Then cannot be a divisor of nor any of because by the first fact. This is the echo of Euclid’s proof. The echo is focused by the second fact giving . So must be a prime in the residue class other than , which gives us the “Euclidean” conclusion that the class is abundant.
Theorem 4 The residue classes modulo are all abundant.
Proof Sketch: For we note that is prime and suppose is the product of all primes in that class. We use the cyclotomic polynomial with giving . Divisors of are hence either or modulo . However, itself is so it cannot have all its prime divisors be , so some dividing must be , but it cannot be any of the , so again we have the “Euclidean” contradiction to those being all the primes in that class.
For or we note that again the residue class is immediately populated and suppose is the product of all its prime members. We use the polynomial value instead. If divides then , so is a quadratic residue modulo . This is either or , and QR theory (noting ) tells us in either case that must be or modulo . The final observation is that since and are both , we have , so . Thus needs to have a prime divisor that is but then by QR we must have without being any of the . Once again this is the Euclidean contradiction.
That “final observation” is what Murty showed—perhaps surprisingly—to be necessary in order to find a suitable polynomial in to begin with. For we get all classes but for , for instance, only , , and (besides ) obey Murty’s criterion. To prove more classes abundant we want to widen the scope of the proofs.
We connect Dirichlet’s Theorem to the famous conjecture by Christian Goldbach that every even number from 4 onward is the sum of two primes. A straightforward sieve argument implies that at least of the even integers can be written as the sum of two primes, but much stronger is known. This 2007 paper by Andrew Granville cites a claim to show that at most of the even integers up to are exceptions, though its author, János Pintz, has only recently released a long proof of as a stepping-stone. Conditioned on the Generalized Riemann Hypothesis, Daniel Goldston improved the bound of Godfrey Hardy and John Littlewood to .
There are however explicitness issues with the constants involved in all unconditional estimates sharper than divided by a polynomial in . None of the proofs is “short.” Hence we’ll say “most” to mean an intuitively provable bound on the number of Goldbach exceptions such that is . Here is our main observation:
Suppose that most even numbers are the sum of two primes. Then there is a short proof that at least residues are abundant.
Let’s explain this statement. First it is not exactly a theorem. Why not? Even granting our meaning of (a short proof of) “most” the conclusion is still subjective. But we can turn the subjectivity to advantage by viewing the statement this way:
Any simple proof of “most” for Goldbach must show that many residue classes contain an infinite number of primes.
Our first point is that we get more than the number of abundant classes from Murty’s criterion. The latter is where is the number of distinct prime factors of and is or or depending on the power of dividing —in all events this number is . We pay a price of explicitness, however: we may not know which classes are abundant.
Theorem 5 Suppose that most even numbers are the sum of two primes. Let be the number of residue classes modulo that are abundant. Then
Proof: Suppose for contradiction that . Then there is some even residue that is not of the form for abundant residues and ( included). Let bound the size of any prime in the non-abundant residues. Then for any even number and primes () allowed such that
at least one of or must come from a non-abundant residue class and hence be . Thus as , the set
has at most members that can be expressed as sums of primes. Hence the density of Goldbach exceptions up to is bounded below by , in violation of the most-Goldbach theorem. Thus .
For we get that the minimum is so we get both and without needing QR. For and however we only get so all but one the possible residue classes again are abundant. Clearly the bound gets worse as we more along. Oh well. But we can still try to improve it.
We do note, however, that the bound we get from our approach is better than any from Euclidean methods for modulo that are prime. By Murty’s theorem it follows that for a prime there are only two residue classes that are provable via the Euclidean method: these are . From estimates noted above, our bounds are better as grows for any kind of .
In some special cases we can make further inferences from the “most” property of Goldbach. Let’s look at . Then there are four residue classes but we still only get . So we miss one. But we can do better. The residue classes are . The only way to get by summing two of them is . The only way to get is summing the other two, . Hence if, say, were not abundant, say , then only finitely many even numbers could be sums of two primes, violating the “most Goldbach” property.
The case uses the symmetry . Taking gives which just satisfies Theorem 5. However, a set of residues will yield two pairs summing to the same value—meaning not enough pairs to cover the congruences of even numbers modulo —unless the three excluded values hit each of the following eight equations viewed as triples:
An inspection shows that there is no hitting set of size 3, so we get that at least classes are abundant. We could try further to argue the way we did with but it does not work: if excludes, say, and (or any of the other three pairs summing to ) then the remaining six residues cover all even-number congruences. Thus getting abundant residues is the most for our style of argument thus far.
Are there possible further improvements? Perhaps a way to leverage the tighter bounds on the density of the Goldbach exception set?
We close by noting one other quirk of Dirichlet’s Theorem. If we are given a residue and could always guarantee constructing one prime in some other residue then we could construct infinitely many in . Namely, suppose we had constructed so far. Choose such that for each , . Now take and . The we construct from our assumption is bigger than any but still congruent to modulo , so we can add it to our set and continue.
Thus if what one might call “Dirichlet-one” is constructive then the full Dirichlet Theorem becomes elementary in a clear sense. Indeed, we get the next prime explicitly, not as an unspecified divisor. This connection raises some hope of sharper estimates that play off certain residues having zero primes rather than finitely many, though they are not immediately to hand.
Can we improve the above method to prove more than a square-root bound? It would be neat if we could show that more is true. There are further ideas that Ken and I are thinking about—more in the future.
]]>