Cropped from src1 & src2 in gardens for karma |
Prasad Tetali and Robin Thomas are mathematicians at Georgia Tech who are organizing the Conference Celebrating the 25th Anniversary of the ACO Program. ACO stands for our multidisciplinary program in Algorithms, Combinatorics and Optimization. The conference is planned to be held starting this Monday, January 9–11, 2017.
Today I say “planned” because there is some chance that Mother Nature could mess up our plans.
Atlanta is expected to get a “major” snow storm this weekend. Tech was already closed this Friday. It could be that we will still be closed Monday. The storm is expected to drop 1-6 inches of snow and ice. That is not so much for cities like Buffalo in the north, but for us in Atlanta that is really a major issue. Ken once flew here to attend an AMS-sponsored workshop and play chess but the tournament was canceled by the snowfall described here. So we hope that the planned celebration really happens on time.
Attendance is free, so check here for how to register.
The program has a wide array of speakers. There are 25 talks in all including two by László Babai. I apologize for not listing every one. I’ve chosen to highlight the following for a variety of “random” reasons.
László Babai
Graph Isomorphism: The Emergence of the Johnson Graphs
Abstract: One of the fundamental computational problems in the complexity class NP on Karp’s 1973 list, the Graph Isomorphism problem asks to decide whether or not two given graphs are isomorphic. While program packages exist that solve this problem remarkably efficiently in practice (McKay, Piperno, and others), for complexity theorists the problem has been notorious for its unresolved asymptotic worst-case complexity.
In this talk we outline a key combinatorial ingredient of the speaker’s recent algorithm for the problem. A divide-and-conquer approach requires efficient canonical partitioning of graphs and higher-order relational structures. We shall indicate why Johnson graphs are the sole obstructions to this approach. This talk will be purely combinatorial, no familiarity with group theory will be required.
This talk is the keynote of the conference. Hopefully Babai will update us all on the state of this graph isomorphism result. We have discussed here his partial retraction. I am quite interested in seeing what he has to say about the role of Johnson graphs. These were discovered by Selmer Johnson. They are highly special: they are regular, vertex-transitive, distance-transitive, and Hamilton-connected. I find it very interesting that such special graphs seem to be the obstacle to progress on the isomorphism problem.
Petr Hliněný
A Short Proof of Euler-Poincaré Formula
Abstract: We provide a short self-contained inductive proof of famous Euler-Poincaré Formula for the numbers of faces of a convex polytope in every dimension. Our proof is elementary and it does not use shellability of polytopes.
The paper for this talk is remarkably short, only 3 pages. Of course the result has been around since the 1700s and 1800s, and David Eppstein already has a list of 20 proofs of it, so what is the point? It has to do with ways of proving things and the kind of dialogue we can have with ourselves and/or others about what is needed and what won’t work. Imre Lakatos famously codified this process, with this theorem as a running example conjuring up the so-called Lakatosian Monsters. Perhaps the talk will slay the monsters, but it will have to brave some snow and ice first.
Luke Postle
On the List Coloring Version of Reed’s Conjecture
Abstract: In 1998, Reed conjectured that chromatic number of a graph is at most halfway between its trivial lower bound, the clique number, and its trivial upper bound, the maximum degree plus one. Reed also proved that the chromatic number is at most some convex combination of the two bounds. In 2012, King and Reed gave a short proof of this fact. Last year, Bonamy, Perrett and I proved that a fraction of 1/26 away from the upper bound holds for large enough maximum degree. In this talk, we show using new techniques that the list-coloring versions of these results hold, namely that there is such a convex combination for which the statement holds for the list chromatic number. Furthermore, we show that for large enough maximum degree, a fraction of 1/13 suffices for the list chromatic number, improving also on the bound for ordinary chromatic number. This is joint work with Michelle Delcourt.
Mohit Singh
Nash Social Welfare, Permanents and Inequalities on Stable Polynomials
Abstract: Given a collection of items and agents, Nash social welfare problem aims to find a fair assignment of these items to agents. The Nash social welfare objective is to maximize the geometric mean of the valuation of the agents in the assignment. In this talk, we will give a new mathematical programming relaxation for the problem and give an approximation algorithm based on a simple randomized algorithm. To analyze the algorithm, we find new connections of the Nash social welfare problem to the problem of computation of permanent of a matrix. A crucial ingredient in this connection will be new inequalities on stable polynomials that generalize the work of Gurvits. Joint work with Nima Anari, Shayan Oveis-Gharan and Amin Saberi.
There are two. One is, will we be snowed in or snowed out this Monday? The other is, can some of the open problem raised by these talks be solved?
]]>
Even after today’s retraction of quasi-polynomial time for graph isomorphism
Cropped from source |
László Babai is famous for many things, and has made many seminal contributions to complexity theory. Last year he claimed that Graph Isomorphism (GI) is in quasi-polynomial time.
Today Laci posted a retraction of this claim, conceding that the proof has a flaw in the timing analysis, and Ken and I want to make a comment on what is up. Update 1/10: He has posted a 1/9 update reinstating the claim of quasi-polynomial time with a revised algorithm. As we’ve noted, he is currently speaking at Georgia Tech, and we hope to have more information soon.
Laci credits Harald Helfgott with finding the bug after “spending months studying the paper in full detail.” Helfgott’s effort and those by some others have also confirmed the mechanism of Laci’s algorithm and the group-theoretic analysis involved. Only the runtime analysis was wrong.
Helfgott is a number theorist whose 2003 thesis at Princeton was supervised by Henry Iwaniec with input by Peter Sarnak. Two years ago we discussed his claimed proof of the Weak Goldbach Conjecture, which is now widely accepted.
In December 2015, Laci posted to ArXiv an 89-page paper whose title claimed that GI can be solved in quasi-polynomial time. Recall that means that the algorithm runs in time for some constant . This an important time bound that is above polynomial time, but seems to be the right time bound for many problems. For example, group isomorphism has long been known to be in quasi polynomial time. But the case of graphs is much more complex, and this was reason that Babai’s claimed result was so exciting. We covered it here and here plus a followup about string isomorphism problems that were employed.
He also chose to give a series of talks on his result. Some details of the talks were reported by Jeremy Kun here.
Retracting a claim is one of the hardest things that any researcher can do. It is especially hard to say when to stop looking for a quick-fix and make an announcement. It may not help Laci feel any better, but we note that Andrew Wiles’s original proof of Fermat’s Last Theorem was also incorrect and took 15 months to fix. With help from Richard Taylor he repaired his proof and all is well. We wish Laci the same outcome—and we hope it takes less time.
In particular, his algorithm still runs faster than for any you care to name. For comparison, for more than three decades before this paper, the best worst-case time bound was essentially due to Eugene Luks in 1983. The new bound in full is
for some fixed that will emerge in the revised proof.
The important term is the . The function is exponential in . We previously encountered a recursion involving in the running time of space-conserving algorithms for undirected graph connectivity (see this paper) before Omer Reingold broke through by getting the space down to and (hence) the time down to polynomial. So there is some precedent for improving it.
As things stand, however, GI remains in the “extended neighborhood” of exponential time. Here is how to define that concept: Consider numerical functions given by formulas built using the operations and and . Assign each formula a level by the following rules:
Note that if has level then so does the power for any fixed because . The functions of level include not only all the polynomials but also all quasi-polynomial functions and ones such as , which is higher than quasi-polynomial when .
The amended bound on GI, however, belongs to level , which is what we mean by its staying in the extended neighborhood of exponential time. This is the limit on regarding the amended algorithm as “sub-exponential.”
It also makes us wonder about why it is so difficult to find natural problems with intermediate running times. We can define this notion by expanding the notion of “level” with a new rule for functions that are sufficiently well behaved:
Rule 5 subsumes rules 3 and 4 given that has level and has level . A special case is that when and has level , then has level .
We wonder when and where rule 5 might break down, but we note that careful application of rule 2 for multiplication when expanding a power makes it survive the fact that , , , and so on all have the same level. It enables defining functions of intermediate levels where .
Can the GI algorithm be improved to a level ?
We note one prominent instance of level in lower bounds: Alexander Razborov and Steven Rudich proved unconditionally in their famous “Natural Proofs” paper that no natural proof can show a level higher than for the discrete logarithm problem.
The obvious open problems are dual. Is the amended result fully correct? And can the original quasi-polynomial time be restored in the near future, or at least some intermediate level achieved? We hope so.
[fixed discussion of terms related to , added to the intro an update about the claim being reinstated]
AIP source—see also interview |
Robert Marshak was on hand for Trinity, which was the first detonation of a nuclear weapon, ever. The test occurred at 5:29 am on July 16, 1945, as part of the Manhattan Project. Marshak was the son of parents who fled pogroms in Byelorussia. Witnessing the test, hearing the destruction of Hiroshima and Nagasaki, and knowing his family history led him to become active in advancing peace. He soon co-founded and chaired the Federation of Atomic Scientists and was active in several other organizations promoting scientific co-operation as a vehicle of world peace. In 1992 he won the inaugural award of the American Association for the Advancement of Science for Science Diplomacy.
Today, the fifth day of both Chanukah and Christmas, we reflect on the gift of international scientific community.
International scientific co-operation is a theme of the movie Arrival and the story on which it is based. However, the key plot turn is a personal contact. A new example of the former is a vaccine for the Ebola virus. This item ends with words by Swati Gupta of the Merck pharmaceutical company:
“There’s been a lot of international partners that have come together in a real unprecedented effort.” The magnitude of the outbreak in West Africa, she says, made companies, governments and academic institutions push aside their own research agendas to come together and finish a vaccine.
There are countless other gifts from the former to be thankful for. We however will sing the latter, the personal side, while highlighting the role of shared experience and values in fostering research.
Marshak and his first student, George Sudarshan, worked out the “” (vector minus axial vector) structure needed to describe certain fermion interactions. Recall a fermion is a particle that obeys the Pauli exclusion principle. They published in the proceedings of a 1957 conference in Italy, whereas its second discoverers, Richard Feynman and Murray Gell-Mann, published in a major journal. This bio of Marshak speculates that uncertainty about priority warded off a Nobel Prize, but one can also point to the theory’s incompleteness in describing the weak nuclear force. First, it allowed models that conserve CP. The surprising discovery in 1964 that nature does not conserve CP won a Nobel Prize for James Cronin and Val Fitch. Second, its framework could not adapt to introduce a carrying particle for the weak force, obstructing the renormalization procedure by which predictions at high energies can be calculated. Still, the concept is a standard building block and remains consistent.
Marshak became chair of the University of Rochester physics department where he had started before the war. The same bio credits him with elevating UR to the level of other top-10 physics departments but being unable to land the same caliber of students. Hence he specially reached out to the brightest of India, Pakistan, and Japan in particular. Sudarshan hailed from Kerala, India, and verged on a Nobel later as well.
Marshak was among “approximately six” US scientists who visited the Soviet Union after the death of Josef Stalin in 1953 made contact possible, and he made several return visits in the 1950s. In the 1960s many Rochester colleagues induced him to lead the faculty senate against conservative policies of the university administration. He then became president of CCNY to stir together tuition-freedom and open-admission policies leading to explosive growth and demographic change.
His final years as a professor at Virginia Tech from 1979 were no less active: as president of the American Physical Society he channeled scientific debate about the feasibility of Ronald Reagan’s SDI, and he brokered an exchange agreement with the Chinese Academy of Sciences. He tragically passed away from a swimming accident on December 23, 1992, a day after completing the final corrections of his textbook Conceptual Foundations of Modern Particle Physics.
Sudarshan edited a posthumous book of essays in tribute to Marshak. Its title, A Gift of Prophecy, references both the many correct guesses about foundations of physics that Marshak made and the fruition of international science as he had envisioned it. Its publisher’s blurb begins:
Marshak devoted much of his life to helping other people carry out scientific research and gather to discuss their work.
From having attended many international conferences and workshops, Dick and I know that much else goes on besides “discussing our work.” We will discuss our families, our home towns, our local academic circumstances. We talk about culture and politics, usually treating culture as common and politics as comparative. We even talk about sports. There is time for discussing specific research problems, but conference excursions more often lend themselves to discussing big and general scientific questions. We have and give individual opinions, yes, but what emerges is the realization that we share a common frame of reference.
To say the shared frame is Rationality—versus whatever—would be facile. To me the frame is distinguished most by the absence of negotiations compared to some other kinds of international contacts. Negotiations at their best are non-zero-sum, but at our best the thought of zero-sum never arises. Instead we are all builders, not only of our field but of common understandings.
Within our departments there are negotiations over resources, but they are between subfields not polities, and our students are shielded from them as much as possible. For instance, nothing is vested in whether the Buffalo CSE graduate student association is led by an American or Chinese or Indian or Iranian (or more) and nobody cares because we have built shared experience and know the common work to do. The similarity of academic life in many locales helps us see humanity first and region afterward. Having advanced to the level of international contact makes us de-facto leaders in our fields, and of course we should pursue international initiatives when opportune, but we submit also this thesis:
A robust international “go-alongishness” may prove more enduring and valuable than any one initiative.
I have also been happy to interact some with students in departments abroad, most recently while teaching a short course at the University of Calcutta last August. I was struck by the similarity of the basic outlook also when speaking at a one-day workshop in Pune, India. Is our communality robust enough to stand up to changing political winds? We fervently hope this for the years ahead, as it was Marshak’s hope.
What are your views on the value of community? For one axis of value, conferences basically all no longer have face-to-face program committee meetings since using Internet and e-mail and spreadsheets is so convenient and cost-effective as to outweigh the sometimes-remarked loss of deliberation. But is there any move toward promoting remote participation in the conferences themselves, and what more would be lost in doing so?
We are thankful for our many friends in our community and wish all of you the best in the coming New Year.
]]>
A second look at Voronin’s amazing universality theorem
Anatoly Karatsuba and Sergei Voronin wrote a book on Bernhard Riemann’s zeta function. The book was translated into English by Neal Koblitz in 1992. Among its special content is expanded treatment of an amazing universality theorem about the classic zeta function proved by Voronin in 1975. We covered it four years ago.
Today Ken and I take a second look and explore a possible connection to complexity theory.
Although the theorem has been known for over forty years there is more to say about it. Voronin extended it to other Dirichlet -functions in ways improved by Bhaskar Bagchi, who also proved universality for the derivative of zeta. Within the past fifteen years, Ramūnas Garunkštis wrote two papers improving the effectiveness of the bounds in the theorem, and Jörn Steuding wrote two papers bounding the density of real values such that approximates a given complex function . Both were part of a 2010 followup paper. Steuding also wrote a wonderful set of notes on the theorem. A 2011 paper by Chris King with beautiful graphic images expands a short 1994 preprint by Max See Chin Woon on how encodes universal information, an aspect highlighted in our earlier post and on Matthew Watkins’s nice page on Voronin’s theorem.
I have known about the theorem for much of this time, but just recently thought there might be some natural way to exploit it to get computational complexity information about the zeta function. Is it surprising to think of such a link? Karatsuba himself emblemizes it, because he devised the first algorithm for multiplying two -digit integers that beats the grade-school method. The study of is so rich that a wider translation of complexity aspects could have profound impact.
Recall the famous zeta function is defined for complex with real part by the summation:
And the zeta is extended to all complex numbers except at where it has a pole, via the Riemann functional equation.
The zeta function holds the secrets to the global behavior of the primes. Even the seemingly weak statement that it does not vanish on complex with is enough to prove the Prime Number Theorem. This theorem says that the number of primes less than is approximately equal to .
The critical section of the zeta function is the strip of with real part between and . The critical line is the middle part where .
The famous Riemann Hypothesis states that all complex zeroes of the zeta function lie on the critical line. This problem remains wide open, although it has been claimed many times. Even proving that no zero lies in the region with real part in seem hopeless. But we can all hope for surprises. Perhaps 2017 will be the magical year for a breakthrough, since it is prime and the only prime between 2011 and 2027.
Let’s turn to a formal statement of Voronin’s Theorem (VT):
Theorem 1 Let there exists such that
Voronin’s book with Karatsuba starts by proving this with the natural logarithm in place of , taking the branch of that is real for real . Then is allowed to have zeroes on the disc. Antanas Laurinčikas has extended universality to some other functions besides .
Really Voronin proved even more: for every :
where is Lebesgue measure, with a similar statement for .
§
If the above statements seem a bit technical let’s try a more informal notion of what VT says. Let be some infinite binary string:
We might agree to call it “universal” provided for any finite string there is a so that
Clearly such universal strings easily exist. Even stronger we could insist that there not only be one that makes the above true, but that there are an infinite number of ‘s. Even stronger we could insist that the set
has positive lower density. That is, there is some such that for all sufficiently large , the number of in the set that occur in the interval is at least .
What VT says is essentially a modification of this simple fact about strings. The infinite string is the values of the zeta function of the form where runs over real numbers. The finite string is replaced by a smooth function defined on a fixed size disc. And exact equality becomes approximation up to some .
What is exciting is not that this is possible but that it happens for a natural and very important particular function: the zeta function. Although universality was long known as an existential phenomenon of power series in analysis, Steuding’s notes remark (emphasis ours):
The Riemann zeta-function and its relatives are so far the only known explicit examples of universal objects.
One easy application of universality is that with positive density in the critical strip one can compute to within any in time . This is simply because one can take to be the constant function on the disk. Or one can take or for any fixed constant .
Well, just not : if could approximate arbitrarily closely in a disk centered on the real part being then by analyticity the Riemann Hypothesis would be false. One can, however, use to stand for when is given, or handle directly using in the strip. Perhaps what follows is better understood using after all, which is how Karatsuba and Voronin develop the proofs in their book.
For different constants we get different sets of such that approximates within . And we get different sets such that approximates within . Our first question is:
How do the sets relate to each other? Likewise the sets ?
Most in particular, can we “add” these sets? What we mean is that there should be an effective procedure such that given specifications for and for outputs a specifier for . The specifiers could just be members of the sets. Using enables us to do this for , though the following purpose can work with positive integers only.
Next, can we “multiply”? That is, can we find an effective procedure that given and outputs a specifier for ?
Now we can tell where this is going. We will build up arithmetical formulas and circuits for functions and using constants and gates. Given specifiers and , we want:
We will need to start with some basic functions beginning with the identity function . Our functions are formally defined on the complex disc but it is easy to represent functions on over this domain. The main limitation noted by Garunkštis is that his effectivizations for apply only to discs of small radius , but any radius seems good for encoding discrete functions.
We certainly expect to be available. The proofs of universality all work by representing as an infinite series. One can add the terms for and to get a series for . The multiplication case would then involve products of infinite series but perhaps they are well-behaved. The hardest challenge may be to make this work neatly when the tokens give only partial information about and . We may also wish to implement some recursion operations besides and . But once we have—if we have—a rich enough set of effective operations, we can program the whole panoply of computable functions into . Note that these representations are already known to exist for any ; we are just talking about how easy it is to program them.
The potential payoff is, perhaps questions about whether certain complex arithmetic functions have small circuits or formulas can be attacked via the resulting constraints on the behavior of . Ultimately what is going on is a kind of discrete programming using the prime numbers that go into product formulas for . For and arguments with there is the particularly nice summation formula
where ranges over all primes, which is then continued inside the strip. The point is that channels a “natural” way to do the programming. Lots of tools of analysis might be available for complexity questions framed this way.
Well, this might be a pipe dream—especially since we can’t yet even tell whether blows up in the kind of discs we are considering—but this is the week when Santa smokes his pipe preparing for a long journey and lots of children have dreams.
Does our “book of zeta” idea seem feasible to compile?
]]>
Lessons from the Park that still apply today
Iain Standen is the CEO of the Bletchley Park Trust, which is responsible for the restoration of the Park. After the war, the Park was almost completely destroyed and forgotten—partially at least for security reasons. Luckily it was just barely saved and is now a wonderful place to visit and see how such a small place helped change history.
Today I would like to report on a recent trip to Bletchley Park.
You probably know that Bletchley Park was the place where during World War II the British, with help from the Poles, were able to break the Enigma Machine. This machine, which had various models and versions, was the main one used by the Germans to encrypt and decrypt messages during the war. The ability to read their encrypted messages was invaluable to the war effort, and it is claimed that perhaps millions of lives were saved.
My wife, Kathryn Farley, along with her brother Andrew and I were in London recently. During our stay Kathryn set up a day trip to Bletchley Park, which is over an one hour car ride from where we were staying in London. This was a tremendous experience and I would definitely suggest getting to the Park if you can.
Bletchley Park consisted of the main house and a number of “huts.” The latter were primitive buildings that were needed as the number of workers grew rapidly during the war. Huts were numbered and their number became strongly associated with the work that was done there.
We will call it the Park, but the actual names used during the war include:
Some of the huts:
I have argued before here that one of the biggest breakthroughs in modern cryptography is the treatment of messages not as a series of letters, but rather treating them as a whole number. This leap from separate letters to a whole number, I believe, is instrumental in enabling researchers to create modern codes: where would we be if we still though of messages as series of letters? How could one even think of methods based on Elliptic Curves?
Jumping back over 60 years we see that the Enigma machines indeed viewed messages as a series of 26 letters. They added simple rules for punctuation: a common rule was that “See you at noon” would become “SEEXYOUXATXNOON”.
Each letter was effectively scrambled by a permutation on 26 letters. Of course the number of such permutations is which is already a huge number. Yet a code that only used one permutation to encrypt messages would easily be broken. One way to see this is to realize that cryptogram puzzles occur every day in most newspapers, and you are expected to be able to solve them in minutes.
The reason for their weakness is that messages are usually from a languages, such as German in WW II, whose tremendous redundancy makes a single substitution code easily breakable. What the Enigma did was change the permutation used from letter to letter in a wider-ranging manner than any poly-substitution cipher had done before: The first letter used some permutation, this then was changed to a new permutation for the second letter, and so on. The actual way the Enigma machines moved from one permutation to the next was based on a clever use of mechanical wheels, called rotors. How the workings of how these rotors changed the permutations is the key—bad pun—to why Enigma machines were hard to break. The Germans thought their complex motion made the machines unbreakable, but the work at Bletchley Park proved them wrong.
Here is a schematic of how current flowed through the rotors and could be changed by a key-press to turn on a light. The lighted letter was the encryption of the pressed letter. For more details see this:
The fundamental reason I think the breaking of Enigma is still interesting today, over
70 years later, is that it contains simple lessons even for modern “unbreakable” codes. Here are some of the lessons:
Key Size: The Enigma machines had a huge-size key, since the key included the choice of rotors choice, the rotors’ positions, the plugboard settings, and more. Any attack that was brute-force or even near brute force was hence doomed, and even today it would fail. But of course the key size means nothing if there is an attack that avoids trying all the keys.
Operator Error: The Enigma machines were often misused in practice. The operators often violated simple rules in a way that made the security vastly lower. Examples include re-sending messages over with the same key—called a depth—and using shortcuts in selecting settings of the rotors. For instance, often the rotor settings were only slightly changed from one day to the next. Note, one could argue that operator error today is still happening. It also includes poor implementations of the codes: there are attacks on “unbreakable” modern codes that work because of bad implementations, even with RSA. One of my favorites is the attack that is sometimes called The Bellcore Attack. This exploits errors in the execution of a code. Okay, I was involved in creating this attack along with Dan Boneh and Rich DeMillo.
Hidden Design Flaws: The Enigma machines had a fundamental design flaw. This flaw leaked a fair amount of information, that could be and was exploited in the attacks. The flaw was this property of the Enigma machines:
If the letter was encoded into , it is always the case that and are distinct letters.
Put another way: no letter ever encrypts to the itself. This was a tremendous mistake that helped break the whole complex system. One famous example was the following: It was noted that an encrypted message had a long run that had no occurrence of the letter “L.” The only way this could have happened with any reasonable probability is if the message was a series of “L”’s. This enabled a break into that key. The operator had been testing the system and just kept repeating the same key “L” over and over.
Even apart from such mistakes, the system arguably leaked bits of information per letter. To find a key needing bits to specify, as little as letters of ciphertext could suffice to determine it. Of course it still took incredible work to extract the key, but Alan Turing’s stroke of genius was how to automate the kind of “puzzle-solving” needed for such tasks.
New Technology: The Enigma machines were actually fairly secure, but the Germans did not envision that the attackers would use a machine to break their machines. These machines, called “bombes,” were critical to the success at the Park. Today the analogy might be that someone already has a working quantum computer and can break any code that depends on discrete log or on factoring. How do we know that our “unbreakable codes” have not already been cracked? Indeed.
A picture of a “bombe” at the Park. These were created to break the Enigma machines:
A curiosity: it is believed that after the war all of the bombes at the Park were destroyed. The ones on display are reproductions. Our tour guide told us that there is a folk belief that perhaps there is an original bombe hidden somewhere on the grounds of the Park. Perhaps during the on-going renovations a bombe will discovered hidden under a floor or in a wall. Who knows—the Park may have more secrets yet to be uncovered.
I have wondered if our current codes will be looked back on one day and seen to be easy to break. What do you think?
Baku Olympiad source—note similarity to this |
Magnus Carlsen last week retained his title of World Chess Champion. His match against challenger Sergey Karjakin had finished 6–6 after twelve games at “Standard” time controls, but he prevailed 3–1 in a four-game tiebreak series at “Rapid” time controls. Each game took an hour or hour-plus under a budget of 25 minutes plus 10 extra seconds for each move played.
Today we congratulate Carlsen and give the second half of our post on large data being anomalous.
According to my “Intrinsic Performance Ratings” (IPRs), Carlsen played the tiebreak games as trenchantly as he played the standard games. I measure his IPR for them at 2835, though with wider two-sigma error bars +- 250 than the 2835 +- 135 which I measured for the twelve standard games. Karjakin, however, played the rapid games at a clip of 2315 +- 340, significantly below his mark of 2890 +- 125 for the regular match. The combined mark was 2575 +- 215, against 2865 +- 90 for the match. It must be said that of course faster chess should register lower IPR values. My preliminary study of the famous Melody Amber tournaments, whose Rapid sections had closely similar time controls, finds an overall dropoff slightly over 200 Elo points. Thus the combined mark was close to the expected 2610 based on the average of Carlsen’s 2853 rating and Karjakin’s 2772. That Carlsen beat his 2650 expectation, modulo the error bars, remains the story.
Carlsen finished the last rapid game in style. See if you can find White’s winning move—which is in fact the only move that avoids losing:
The win that mattered most, though, was on Thanksgiving Day when Carlsen tied up the standard match 5–5 with a 75-move war of attrition. The ChessGames.com site has named it the “Turkey Grinder” game. On this note we resume talking about some bones to pick over “Big Data”—via large data taken using the University at Buffalo Center for Computational Research (CCR).
If you viewed the match on the official Agon match website, you saw a slider bar giving the probability for one side or the other to win. Or rather—since draws were factored in—the stands for the points expectation, which is the probability of winning plus half the probability of drawing. This is computed as a function of the value of the position from the player’s side. The beautiful fact—which we have discussed before in connection with a 2012 paper by Amir Ban—is that is an almost perfect logistic curve. Here is the plot for all available (AA) games at standard time controls in the years 2006–2015 with both players within 10 Elo points of the Elo 2000 level:
The “SF7d00” means that the chess program Stockfish 7 was run in Multi-PV mode to a variable depth between 20 and 30 ply. My scripts now balance the total number of positions searched so that endgame positions with fewer pieces are searched deeper. “LREG2” means the generalized logistic curve with two parameters. Using Wikipedia’s notation, I start with
and fix to symmetrize. Then is basically the chance of throwing away a completely winning game—and by symmetry, of winning a desperately lost game.
Chess programs—commonly called engines—output values in discrete units of called centipawns (cp). Internally they may have higher precision but their outputs under the standard UCI protocol are always whole numbers of cp which are converted to decimal for display. They often used to regard or as the value for checkmate, but has become standard. I still use as cutoffs and divide the -axis into “slots”
Positions of value beyond the cutoff belong to the end slots. Under a symmetry option, a position of value goes into both the slot for the player to move and the slot for the opponent. This is used to counteract the “drift” phenomenon discovered in this paper with my students that the player to move has a 2–3% lower expectation across all values—evidently because that player has the first opportunity to commit a game-chilling blunder.
The “b100” means that adjacent slots with fewer than 100 moves are grouped together into one “bucket” whose value is the weighted average of those slots. Larger slots are single buckets rather than divided into buckets of 100. The end slots and zero (when included) are single buckets regardless of size. Finally, the number after “sk” for “skedasticity” determines how buckets are weighted in the regression as I discuss further on.
The -value of a bucket is the sum of wins plus half of draws by the player enjoying the value (whose turn it might or might not be to move) divided by the size of the bucket. This is regressed to find most closely giving . The slope at zero is . The quantity
gives the expectation when ahead—figuratively the handicap at odds of a pawn. Note how close this is to 70% for players rated 2000.
The fit is amazingly good—even after allowing that the value, so astronomically close to , is benefiting from the correlation between positions from the same game, many having similar values. Not only does it give the logistic relationship the status of a natural law (along lines we have discussed), but also Ban argues that chess programs must conform to it in order to maximize the predictive power of the values they output, which transmutes into playing strength. The robustness of this law is shown by this figure from the above-linked paper—being rated higher or lower than one’s opponent simply shifts the curve left or right:
This is one of several reasons why my main training set controls by limiting to games between evenly-rated players. (The plots are asymmetric in the tail because they grouped buckets from up to rather than come in from both ends as the present ones do.)
Most narrowly to our goal, the value determines the scale by which increases in value translate into greater expectation, more directly than quantities like or . Put simplistically, if a program values a queen at 10 rather than 9, one might expect its “” to adjust by 9/10. Early versions of Stockfish were notorious for their inflated scale. The goal is to put all chess programs on a common scale by mapping all their values to points expectations—and Ban’s dictum says this should be possible. By putting sundry versions of Stockfish and Komodo and Houdini (which placed 2nd to Stockfish in the just-concluded ninth TCEC championship) on the same scale as my earlier base program Rybka 3, I should be able to carry over my model’s trained equations to them in a simple and direct manner. Here is the plot for Komodo 10’s evaluations of the same 100,000+ game positions:
The fit is just as fine. The values are small and equal to within so they can be dismissed. The values are for Komodo against for Stockfish, giving a ratio of about . The evaluations for 70% expectation, for Komodo and for Stockfish, have almost the same ratio to three decimal places. So we should be able to multiply Komodo’s values by 1.046 and plug them into statistical tests derived using Stockfish, right?
The error bars of on Komodo’s , which are two-sigma (a little north of “95% confidence”), give some pause because they have wiggle. This may seem small, but recall the also-great fit of the linear regression from (scaled) player error to Elo rating in the previous post. Under that correspondence, 2% error translates to 2 Elo points for every 100 below perfection—call that 3400. For Carlsen and Karjakin flanking 2800 that means only Elo but grows to for 2000-level players. Here is a footnote on how the “bootstrap” results corroborate these error bars and another data pitfall they helped avoid.
But wait a second. This error-bar caveat is treating Komodo’s as independent from Stockfish’s . Surely they are completely systematically related. Thus one should just be able to plug one into the other with the conversion factor and get the same proportions everywhere, right? The data is huge and both the logistic and ASD-to-Elo regressions this touches on have and the force of natural law. At least the “wiggle” can’t possibly be worse than these error bars say, can it?
Here are side-by-side comparison graphs with Stockfish and Komodo on the same set of positions played by players within 10 Elo points of 1750.
Now the Komodo is lower. Here is a plot of the -values for Komodo and Stockfish over all rating levels, together with the Komodo/Stockfish ratio:
The ratio waddles between 0.96 and 1.06 with a quick jag back to parity for the 2700+ elite players. Uncertainty speaks a gap of 5 Elo points for every 100 under perfection, which makes a considerable 70–point difference for Elo-2000 players.
Well, we can try clumping the data into huger piles. I threw out data below 1600 and the 2800 endpoint—which has lots of Carlsen but currently excludes Karjakin since his 2772 is below 2780. I combined blocks of four levels at 1600–1750, 1800–1950, up to 2600–2750, and quadrupled the bucket size to match. Here is the plot for 2200–2350, with a move-weighted average of 2268:
With over 500,000 data points, mirrored to over a million, can one imagine a more perfect fit to a logistic curve? With Stockfish the value even prints as unity. And yet, this is arguably the worst offender in the plot of over these six piles:
The point for 2600–2750 goes down. It is plotted at 2645 since there are far many more 2600s than 2700s players, and it must be said that the 2400–2550 pile has its center 2488 north of 2475 because 2550 included all years whereas the 2000–2500 range starts in the year 2006. But the data point for 2200–2350 is smack in the middle of this range. Why is it so askew that neither regression line comes anywhere near the error bars for the data taken with the respective engine?
Getting a fixed value for the ratio is vital to putting engines on a common scale that works for all players. The above is anything but—and I haven’t even told what happens when Rybka and Houdini enter the picture. It feels like the engines diverge not based on their evaluation scales alone but on the differences in their values for inferior moves that human players tend to make, differences that per the part-I post correspond almost perfectly to rating. Given Amir Ban’s stated imperative to conform any program’s values to a logistic scale in order to maximize its playing strength, and the incredible fit of such a scale at all individual rating levels, how can this be?
I get similar wonkiness when I try to tune the ratio internally in my model, for instance to equalize IPRs produced with Komodo and Stockfish versions to those based on Rybka 3. There is also an imperative to corroborate results obtained via one engine in my cheating tests by executing the same process with test data from a different engine. This has been analogized to the ‘A’ and ‘B’ samples in doping tests for cycling, though those are taken at the same time and processed with the same “lab engine.”
I had hoped—indeed expected—that a stable conversion factor would enable the desirable goal of using the same model equations for both tests. I’ve become convinced this year that instead it will need voluminous separate training on separate data for each engine and engine version. A hint of why comes from just looking at the last pair of Komodo and Stockfish plots. All runs skip the bucket for an exact 0.00 value, which by symmetry always maps to 0.50. Its absence leaves a gap in Komodo’s plot, meaning that Komodo’s neighboring nonzero values carry more weight of imbalance in the players’ prospects than do 0.01 or -0.02 etc. coming from Stockfish. The data has 48,693 values of 0.00 given by Komodo 10 to only 43,176 given by Stockfish 7. Whereas, Komodo has only 42,350 values in the adjacent ranges -0.10 to -0.01 and +0.01 to +0.10 (before symmetrizing) to 47,768 by Stockfish. The divergence in plot results may be amplified by the “firewall at zero” phenomenon I observed last January. The logistic curves are dandy but don’t show the cardinalities of buckets, nor other higher-moment effects.
In the meantime I’ve been using conservative ratios for the other engines relative to Rybka. For example, my IPRs computed in such manner with Komodo 10 are:
These are all 70–100 and so lower than the values I gave using Rybka. Critics of the regular match games in particular might agree more with these than my higher official numbers, but this needs to be said: When I computed the Rybka-based IPR for the aggregate of moves in all world championship matches since FIDE’s adoption of Elo ratings in 1971, and compared it with the move-weighted average of the Elo ratings of the players at the time of each match, the two figures agreed within 2 Elo points. Similarly weighting the IPRs for each match in my compendium gives almost the same accuracy.
That buttresses my particular model, but the present trouble happens before the data even gets to my model. Not even the scaling stage discussed in the last post is involved here. This throws up a raw existential question.
Much of data analytics is about “extracting the signal form the noise” when there is initially a lot of noise. Multiple layers of standard filters are applied to isolate phenomena. But here we are talking about raw data—no filters. All we have observed are the smooth linear correspondence between chess rating and average loss of position value and the even more perfect logistic relation between position value and win/draw/loss frequency. All we did was combine these two relations. The question is:
How did I manage to extract so much noise from such nearly-perfect signals?
Can you see an explanation for this wonkiness in my large data? What caveats for big-data analytics does it speak?
The chess answer is that Carlsen played 50.Qh6+!! and Karjakin instantly resigned, seeing Kxh6 51. Rh8 mate, and that after 50…gxh6 the other Rook drives home with 51. Rxf7 mate.
Update 12/11/16: Here is a note showing what happens when all drawn games are removed. The data point for 2200–2350 is even more rogue…
[fixed point placement in last figure, added “Baku Olympiad” to first caption, some word changes, added update and acknowledgment]
Slate source |
Benjamin Franklin was the first American scientist and was sometimes called “The First American.” He also admired the American turkey, counter to our connotation of “turkey” as an awkward failure.
Today I wonder what advice Ben would give on an awkward, “frankly shocking,” situation with my large-scale chess data. This post is in two parts.
A common myth holds that Franklin advocated the turkey instead of the bald eagle for the Great Seal of the United States. In 1784, two years after the Great Seal design was approved over designs that included a bird-free one from Franklin, he wrote a letter to his daughter saying he was happy that the eagle on an emblem for Revolutionary War officers looked like a turkey. Whereas the eagle “is a Bird of bad moral Character [who] does not get his Living honestly” and “a rank coward,” the turkey is “in Comparison a much more respectable Bird, […and] though a little vain & silly, a Bird of Courage.” The Tony-winning 1969 musical 1776 cemented the myth by moving Franklin’s thoughts up eight years.
More to my point is a short letter Franklin wrote in 1747 at the height of his investigations into electricity. In his article “How Practical Was Benjamin Franklin’s Science,” Ierome Cohen summarizes it as admitting “that new experimental data seemed not to accord with his principles” and quotes (with Franklin’s emphasis):
“In going on with these Experiments, how many pretty systems do we build, which we soon find ourselves oblig’d to destroy! If there is no other Use discovered of Electricity, this, however, is something considerable, that it may help to make a vain Man humble.”
My problem, however, is that the humbling from data comes before stages of constructing my system. Cohen moves on to the conclusion of a 1749 followup letter and ascribes it to Franklin’s self-deprecating humor:
“Chagrined a little that we have been hitherto able to produce nothing in this way of use to mankind; [yet in prospect:] A turkey is to be killed for our dinner by the electrical shock, and roasted by the electrical jack, before a fire kindled by the electrified bottle: when the healths of all the famous electricians in England, Holland, France, and Germany are to be drank in electrified bumpers, under the discharge of guns from the electrical battery. ”
My backbone data set comprises all available games compiled by the ChessBase company in which both players were within 10 points of the same century or half-century mark in the Elo rating system. Elo ratings, as maintained by the World Chess Federation (FIDE) since 1971, range from Magnus Carlsen’s current 2853 down to 1000 which is typical of novice tournament players. National federations including the USCF track ratings below 1000 and may have their own scales. The rating depends only on results of games and so can be applied to any sport; the FiveThirtyEight website is currently using Elo ratings to predict NFL football games. Prediction depends only on the difference in ratings, not the absolute numbers—thus FiveThirtyEight’s use of a range centered on 1500 does not mean NFL teams are inferior to chess players. The linchpin is that a difference of 200 confers about a 75% expectation for the stronger player.
The difference of 81 to Sergey Karjakin’s 2772 gave Carlsen about a 61% points expectation, which FiveThirtyEight translated into an 88% chance of winning the match. This assumed that tiebreaks after a 6-6 tie—the situation we have today—would be a coinflip.
A chief goal of my work—besides testing allegations of human players cheating with computers during games—is to measure skill by analyzing the quality of a player’s moves directly rather than only by results of games. A top-level player may play only 100 games in a given year, a tiny sample, but those games will furnish on the order of 3,000 moves—excepting early “book” opening moves and positions where the game is all-but-over—which is a good sample. The 12 match games gave me an even better ratio since several games were long and tough: 517 moves for each player. My current model assesses Karjakin’s level of play in these games at 2890 +- 125, Carlsen’s at 2835 +- 135, with a combined level of 2865 +- 90 over 1,034 moves. The two-sigma error bars ward against concluding that Karjakin has outplayed Carlsen, but they do allow that Karjakin brought his “A-game” to New York and has played tough despite being on the ropes in games 3 and 4. No prediction for today’s faster-paced tiebreak games is ventured. (As we post, Carlsen missed wins in the second of four “Rapid” paced games; they are playing the third now still all-square.)
These figures are based on my earlier training sets from the years 2006–2013 on Elo century points 2000 through 2700, in which I analyzed positions using the former-champion Rybka 3 chess program. Rybka 3 is now far excelled by today’s two champion programs, called Komodo and Stockfish. I have more than quadrupled the data by adding the half-century marks and using all years since 1971, except that the range 2000-to-2500, with by far the most published games, uses the years 2006–2015. In all it has 2,926,802 positions over 48,416 games. The milepost radius is widened from 10 to 15 Elo points for the levels 1500–1750 and 2750, to 20 for 1400–1450 and 2800, and to 25 for 1050–1350. All levels have at least 20,000 positions except 1050–1150, while 2050–2300 and 2400 have over 100,000 positions each and 2550 (which was extended over all years) has 203,425. All data was taken using the University at Buffalo Center for Computational Research (CCR).
One factor that goes into my “Intrinsic Performance Ratings” is the aggregate error from moves the computer judges were inferior. All major programs—called engines—output values in discrete units of 0.01 called centipawns. For instance, a move value of +0.48 leaves the player figuratively almost half a pawn ahead, while -0.27 means a slight disadvantage. If the former move is optimal but the player makes the latter move, the raw difference is 0.75. Different engines have their own scales—even human chess authorities differ on whether to count a Queen as 9 or 10—and the problem of finding a common scale is the heart of my turkey.
Here are my plots of average raw difference (AD) over all of my thirty-six rating mileposts with the official Komodo 10.0 and Stockfish 7 versions. Linear regression, weighted by the number of moves for each milepost, was done from AD to Elo, so that the rating of zero error shows as the -intercept.
Having means these are fantastic fits—although some “noise” is evident below the 1900 level and more below 1600, it straddles the fit line well until the bottom. Although the range for Elo 2000 through 2500 is limited to years after 2006 there is no discontinuity with neighboring levels which include all years. This adds to my other evidence against any significant “rating inflation”—apart from a small effect explainable by faster “standard” time controls since the mid-1990s, the quotient from rating to intrinsic quality of play has remained remarkably stable.
The scales between Komodo and Stockfish are also quite close. I am using Stockfish as baseline since it is open-source; to bring Komodo onto its scale here suggests multiplying its values by . The first portent of trouble from higher moments, however, comes from the error-bar ratio being tangibly different.
Komodo 10 and Stockfish 7 agree within their error bars on the -intercept, but both place the “rating of perfect play” at most 3200. This is markedly below their published ratings of 3337 and 3354 on the CCRL rating list for the 64-bit single-core versions which I used. This is for “semi-rapid” chess but is meant to carry over to standard time controls. The ratings of slightly later versions on TCEC for standard time controls are both about 3230. This is the source of my quip on the Game 7 broadcast about the appearance of “computers being rated higher than ‘God’.”
A second issue is shown by graphing the average raw error as a function of the overall position value (i.e., the value of an optimal move) judged by the program, for players at any one rating level. Here they are with Stockfish 7 for the levels 1400, 1800, 2200, and 2600 (a few low-weight high outliers near the -4.0 limit have been scrubbed):
If taken at face value, this would say e.g. that 2600-level players, strong grandmasters, play twice as badly (0.12 error) when they are 0.75 ahead as when the game is even (0.06). Tamal Biswas and I found evidence against a claim that this effect is rational. Hence it is a second problem.
What has most immediately distinguished mine from others’ work since 2008 is that I correct for this effect by scaling the raw errors. My scaling function applies locally to each move, using only its value and the overall position value of the best move. I regard it as important that the function is “oblivious” to any regression information hinting the overall level of play. Here are the results of applying it for Stockfish at the Elo 1800 and 2200 levels:
The plot for Elo 2050 is almost perfect, while the flattening remains good especially on the positive side throughout the range 1600–2500 which has almost all the data. I call the modified error metric ASD for average scaled difference, which is in units of “PEPs” for “pawns in equal positions” since the correction metric has value at . Details are as in my papers except that upon seeing a “firewall” effect become more glaring with larger data, I altered its coefficients between the cases and . Here are the resulting plots of ASD versus Elo with Komodo 10 and Stockfish 7:
The fits are even better and with the previous “noise” at lower Elo levels optically much reduced. The -intercepts are now near 3400. This removes the previous conflict with ratings of computers but still leaves little headroom for improving them—an issue I discussed a year ago.
Most to the point, however, is that choosing to do scaling made a hugely significant difference in the intercept. The scaling is well-motivated, but the AD-to-Elo fit was already great without it. I could have stopped and said my evidence—from very large data—pegs perfection at 3200. This prompts us to ask:
How often does it happen that data-driven results have such degree of hidden arbitrariness?
I am sure Ben Franklin would have had some wise words about this. But we haven’t even gotten to the real issue yet.
What lessons for “Big Data” are developing here? To be continued after the match playoff ends…
[made figure labels more consistent with text, updated data size figures, added acknowledgment]
src |
Ken and I wish to thank all who read and follow us. May you have a wonderful day today all day.
But we would like to pose a basic question about teaching complexity theory: Theorems vs. Proofs.
Because today is in the US a national holiday I am not teaching my class on complexity theory, nor is Ken teaching his. I like the class, but I do enjoy the time off from lecturing. Still it seems like a time to reflect on a simple question about teaching.
Today is, of course in the US, Thanksgiving Day. We watch parades, really mainly the Macy’s Thanksgiving Day Parade; we watch football, that is NFL style football; and we watch our waist-lines expand as we eat too much wonderful food—my favorite is the turkey, covered in gravy and served with mashed potatoes.
So while you are enjoying your day let Ken and I ask you a simple question.
What we are interested in is this: Is it as important to know the statement of a theorem as it is to know the proof of the theorem?
I think we almost always when teaching follow the following paradigm:
Thus our question is: Can we skip presenting the proof? Do students still learn something important if they know the statement only of a theorem, but do not learn the proof—or even an outline of a proof? I have wondered over the years of teaching, especially a course like complexity theory, whether we must give both theorem statements and proofs.
There are of course many situations in math where we know the but not the . Perhaps the most famous example is the classification of simple finite groups. This theorem gets used by theory papers but I believe that almost no one applying it knows the proof. You could argue that this is an extreme example, but there are many others that come to mind: the famous regularity theorem of Endre Szemerédi can be used I believe without knowing the proof. As an extreme example I have wondered whether it would be worth it to increase the material I present in class and do this by only proving a small subset of theorems.
I (Ken) am teaching our graduate theory of computation course. This course was until recently required of all PhD students. I still teach it for non-specialists and with emphasis on how to craft a technical argument and write an essay answer—skills for thesis writing in general.
I present some proofs in full and skip or “handwave” others. My full proofs highlight algorithmic ideas and logical structure. For instance, I explain how the proof that nondeterministic space is contained in deterministic time embodies breadth-first search, while nondeterministic time being in deterministic space can be treated as depth-first search. I fold together the proofs of the deterministic space and time hierarchy theorems while diagramming the offline universal simulation they embody. In proving the -completeness of I highlight how re-using variables makes a double-branch recursion into a single branch, and state what I call a “modified proverb of Lao-zi”:
A journey of a thousand miles has a step that is exactly 500 miles from the beginning and 500 miles from the end.
I skip, however, most of the proof of the simulation of a -tape Turing machine by a two-tape oblivious Turing machine . What I show is the division of the first tape into blocks of cells numbered
and the following sequence of “jags”:
Each “jag” for a number begins at cell 0, goes to , then crosses 0 on the way to cell , and returns to 0. I explain that each jag simulates one step of , and finally show or state that the total number of steps by up to the -th jag is .
I prove the theorem of Walter Savitch that nondeterministic space is contained in deterministic space , but only state the theorem that it is closed under complements. That proof I would reserve for an advanced graduate course. Overall I like to highlight a “message” in each proof, such as “software can be efficiently burned into hardware” for the simulation of Turing machines by circuits. This sets up the circuit-based version of the -completeness of , which illustrates formal verification of hardware, and subsequent -completeness theorems as showing how many combinatorial mechanisms embody formal logic in turn.
Enjoy today. If you have a moment between watching the games and eating and other activities please let us know about your thoughts on theorems vs. proofs.
]]>
Dick and I will be on Sunday’s game telecast
Business Insider source |
Magnus Carlsen of Norway and Sergey Karjakin of Russia are midway through their world championship match in New York City. The match is organized by Agon Limited in partnership with the World Chess Federation (FIDE).
Tomorrow, Sunday—early today as I post—at 2pm ET is Game 7 with the match all square after six hard-fought draws. Dick and I are in New York City and will be on the telecast streamed by the sponsoring website, WorldChess.com. A one-time $15 charge brings access to that and all remaining games.
The match is being covered by major media. The movie documentary “Carlsen” opened yesterday. I was also struck by game–by–game coverage on the FiveThirtyEight website, including a post Saturday titled, “Are Computers Draining the Beauty Out Of Chess?”
With us on the gamecast will be Murray Campbell of IBM Watson. He was one of the creators of the machine Deep Blue, which famously defeated Garry Kasparov in ealy 1997. Since then no human player has battled a computer on even terms, while both software and hardware have improved to the point that Kasparov would probably lose to his phone. That is why I have helped draft rules against smartphone in tournament halls and much else as an official consultant of a FIDE Commission to combat cheating, whose chair Israel Gelfer shared lunch with Dick and Kathryn Farley and me earlier today. I will be wearing my deep-blue dress shirt in tribute.
The players occupy a cubicle behind a partition from the main audience. Ever since the 2006 championship reunification match in which Veselin Topalov accused Vladimir Kramnik of getting computer help, and mindful of past whispers about signals, FIDE has reserved the option of forestalling any possible audience input. Cameras show the on-board action. Expert commentators give running analysis for those onsite and the Internet audience. The broadcast team is anchored by Judit Polgar, who in 2005 was the first woman to compete in a round-robin tournament for the FIDE world title, and television journalist Kaja Snare, who previously worked for Norway’s TV2 network.
The games start at 2pm. Each player has a budget of 100 minutes for the first 40 moves plus 30 seconds “increment” after each move played, so four hours may elapse before the game reaches move 40. Then 50 minutes plus the increment are allotted until move 60, then a final 15 minutes plus the increment for the rest of the game. Although 40 is a typical game length, the six draws have averaged 55 moves per game. Games 3 and 4 saw Karjakin hold out for 78 and 92 moves in positions that at times were desperate. Those games were said to have kept Norwegian government ministers up until 3am and slowed the country. Friday’s game, however, finished early—and it must be said as caveat that a short game could cut the time for any of us on the broadcast.
Carlsen is rated 2853 on the Elo rating system, which is 2 points above the record high previously held by Kasparov but about 30 below Carlsen’s own peak. Karjakin is at 2772, which makes him a slight but definite underdog. Arpad Elo designed his rating system in 1960 for the United States Chess Federation and it was adopted by FIDE in 1970. Only relative numbers matter: a linchpin is that a 200 point difference reflects and predicts the stronger player taking about 75% of the points.
The change in one’s Elo rating after a tournament or match depends only on one’s win-draw-loss record and the ratings of one’s opponents. This simplicity makes it easily adaptable to other sports, and FiveThirtyEight uses Elo for their in-house predictions of football games and baseball series among other games. My own work, however, gauges a player’s performance on the Elo scale directly by analysis of the moves he or she played—within a deeper analysis of the moves not played. On that scale I have Carlsen and Karjakin playing dead-even at a very high level, though with considerable 95% confidence error bars:
Carlsen 2880 +- 165; Karjakin 2875 +- 170.
This is reflected also in a less-intensive “screening run” I have devised for quick assessment of large tournaments. It produces a value I call ROI for “Raw Outlier Index” on a 0–100 scale, where 50 is the expected agreement with a particular computer program given one’s rating. My tests using the Stockfish 7 and Komodo 10.2 programs both give the players a combined ROI of 51, with Stockfish giving them 51 apiece. I look forward to explaining how one can design a model that gets things yea-close.
Who will win? Will either one win tomorrow’s game? We welcome you to catch the action.
Update 11/21: As it happened, the game was drawn 15 minutes into the segment where I was appearing—just when I was affirming an opinion by Judit Polgar about human-computer teamwork by pointing to my joint results on “Freestyle” chess. Dick and Murray did not appear. It was still a great experience
Update 11/22: Karjakin sensationally won yesterday’s game to take a 4.5–3.5 lead with 4 games to play. Updated IPR figures: Karjakin 2845 +- 160, Carlsen 2760 +- 180.
Update 11/25: Carlsen evened the match to stand 5-5 after 10 games. IPRs are now Carlsen 2825 +- 145, Karjakin 2875 +- 135, combined 2850 +- 100.
[added caveat about short games, note on broadcast team]
Cropped from source |
Nate Silver has gone out on a limb. Four years ago we posted on how the forecast of his team at FiveThirtyEight jibed with polls and forecasts by other poll aggregators. This year there is no jibe.
Today, Election Day in the USA, we discuss the state of those stating the state of the election.
FiveThirtyEight has the election much closer than most of the other forecasters do. But Silver is no “nut”—last election, in 2012, he was right about the winner of all 50 states and the District of Columbia.
As of their Tuesday morning update, they gave Donald Trump almost a 30% chance of winning, against 70% for Hillary Clinton. For contrast, the Princeton Election Consortium site of Sam Wang and Julian Zelizer has had Clinton over 99% probability in both its “random drift” and “Bayesian” measures, and the Huffington Post gave her 98.2%. Nate Cohn’s New York Times Upshot model put Trump with a 16% chance, but that is still only half what FiveThirtyEight has been giving him. The next-higher numbers in forecasts compared here gave Trump 12% and 11%. Senate forecasts have had similar disparity.
This past weekend, Silver was called out by Ryan Grim in a Huffington article titled, “Nate Silver Is Unskewing Polls—All Of Them—In Trump’s Direction.” The term “unskewing polls” means altering assumptions about the makeup of polling samples to correct perceived bias. In 2012 the complaints of bias in the data used by Silver came mainly from the Republican side and were proved wrong by the results. This year the thunder about numbers seems all on the left.
The main difference cited by Silver is the higher number of voters telling pollsters they are undecided or supporting third-party candidates compared to 2012. There is also greater uncertainty about the effects of news developments such as releases by Wikileaks, the FBI investigation into Clinton’s e-mail server, Obamacare premium hikes, and scandalous past behavior by various people.
There have also been greater movements in polls. Here is the graph of Silver’s forecasts from 2012, when FiveThirtyEight was a blog of the New York Times:
The one counter-trend came after Barack Obama’s poor performance in the first debate with Mitt Romney. There is no evidence that Hurricane Sandy had any effect at the end of October 2012. Now here is the current graph of FiveThirtyEight’s odds over the past few months:
The first sharp movement was registered the week after FBI Director James Comey’s July 5 press conference characterizing Clinton’s e-mail use as “reckless” but not indictable. That brought FiveThirtyEight’s model to parity on July 30, two days after the end of the Democratic convention, but polls completed the next week shot back and continued amid Trump’s unseemly tangling with Khizr and Ghazala Khan. A long trend back to parity, perhaps accelerated by Clinton’s “bad weekend” of Sept. 9–11, bounced again following the first debate on Sept. 26th. The past four weeks have seen a rounding turn into a slide correlated with the Oct. 28 FBI letter re-opening the e-mail investigation of Clinton, and just in the past two days a 7-point jag. The New York Times shows similar movements but not as sharp:
Others have similar graphs. What go into these aggregate models are the polls, and by and large the polls have shown similar movements. Hence I think the key this time is not unskewing the polls but rather the electorate.
I’ve been musing on the possible relevance of freighted phenomena I’ve found while extending my chess model since spring. Heretofore I’ve focused on projecting the best moves; now I want to refine accurate projections for all the moves in a given position. Doing so will confer authority on statistical tests for whole categories of moves—such as captures, moves with Knights, moves that advance or retreat, and moves within a given range of inferiority.
A year ago I reported on work with my student Tamal Biswas, who is now on the University of Rochester faculty after defending his dissertation in July, on implementing a parameter for “depth of thinking.” Computer chess programs all work in rounds of increasing depth of search, and this furnishes an axis of time for human players thinking in the same positions.
Our papers linked from that post show that swings in a program’s value for a given move as the search progresses correlate mightily with the frequency of the human players choosing (or having chosen) that move. For instance, we noted that even for the world’s best players, the frequency can range from 30% to 70% depending only a numerical measure of the swing formulated by Tamal, with the ultimate value of the move in relation to values of alternative moves being held equal. The swing measure also perfectly numerically explains a puzzling “law” which I posted about four years ago.
Last year’s post, however, also reported extreme difficulties with modeling a depth-of-thinking parameter directly. Hence we’re trying a simpler tack of fitting a multiplier on the swing quantity . The ‘h’ is for “heave” by analogy with a ship riding above or below the water line. My usage is not quite “nautically correct”: a ship will heave to for stability in wavy seas, whereas my measures the tendency to be carried away by them. But my modeling supports the following interpretation:
A value h > 1 means that the player(s) are influenced more strongly by swings in values than by the ultimate objective values themselves.
Where previously I had a term relating the difference in value between a move and the machine’s best move to my model’s “sensitivity” parameter , now I have terms like
involving as well. The measure is formulated as an average of values over all depths of search, so I am confident that its units support the interpretation. There are further wrinkles according to whether the overall position value and/or the swing values are negative, and they are all immersed in only-halfway-better forms of the above-mentioned fitting difficulties, so anything I say now is preliminary. But what I am seeing seems consistent enough to report the following:
For chess players of all Elo ratings from novice levels 1050, 1100, 1150, 1200, … to the world championship standard of 2800, the h values are by-and-large all in the range 1.3 to 2.3, and concentrated in 1.5 to 2.0.
I can’t even yet say that I have a regular progression by rating, even though outside the levels 2000 through 2500 (which are most heavily populated among the millions of anthologized games), my training sets have all available games between players at each level (within 10-to-25 Elo points depending). These give tens to hundreds of thousands of data points for each level, all taken using the University at Buffalo Center for Computational Research (CCR)..
My original model has neatly linear progressions in and in a second “consistency” parameter . A second indication that the “high-heave” phenomenon is real is that the three-parameter fits which I obtained in August make the progression steeper and throw the progression into retrograde as a damper. This unwelcome latter fact is a prime reason for tinkering further, besides the fitting landscape being no longer benign.
Thus I believe my model is currently being mathematically inconvenienced by people’s tendency to play moves on impulse and react to (changes in) trend. The measure ticks up when a move suddenly looks better at depth than it did at depth . Results in the papers with Tamal so far support the idea that humans considering such moves experience a corresponding uptick in their estimation. From my own games I recall times I’ve played a move when it suddenly “improved,” then regretted not thinking more on whether it was really better than alternatives.
To repeat, the chess work has not yet reached the point of fully substantiating the effect of swings in value. It is however enough to make me wonder when I see things like FiveThirtyEight’s graph of the race for party control of the Senate:
Are respondents being influenced more strongly by “political weather” than by a prior valuation of their candidates? Note especially the inflection after Comey’s Oct. 28 letter.
The polls are still open in many places as we post, and we have much less idea than we thought four and eight years ago of how things will shake out. Even after all votes are counted it may be hard to tell whether Silver was closer than the others. A strong Clinton win could be carried by the last-day upswing noted in FiveThirtyEight’s graph above, noting also its absence in the Senate graph. Let alone that the election might not be over by tomorrow, to judge by the squeaker in 2000, it will certainly take a long time to parse and “unskew” the election results.
How will we analyze the results of this election? And of course, who will win?
Update 10/9: As it shook out, Silver was merely the least wrong. The USC Dornsife / LA Times poll was distinctive in showing Trump ahead most of the time:
Likewise the Investor’s Business Daily / TechnoMetrica Market Intelligence poll. But even these need to be squared with Clinton’s evidently winning the popular vote. Update 10/10: Silver has a new article showing the effect of a 2% swing, meaning Trump’s share down 1% and Clinton’s up 1%.
[word changes, added links in intro, added update, jive->jibe in intro, added acknowledgment]