Oded Goldreich is one of the top researchers in cryptography, randomness, and complexity theory.
Today Ken and I wish to thank the Knuth Prize Committee for selecting Oded as the winner of the 2017 Knuth Prize.
It is no doubt a wonderful choice, a choice that rewards many great results, and a choice that is terrific. Congrads to Oded. This year the choice was only announced to the general public at the last minute. Ken and I at GLL got an encrypted message that allowed us to figure it out ahead of time. The message was: YXWX APRN LKW CRTLK DHPFW The encryption method is based a code with over
keys, and so was almost unbreakable. But we did it.
Oded gave his talk last night to a filled ballroom: one of the perks of winning the Knuth Prize. I had sent him congrads as soon as I heard he had won and added I looked forward to his talk. He answered essentially “thanks for increasing the pressure on me.” I know he was kidding since he always gives great talks.
I just heard the talk; and he delivered, with his usual mixture of fun and seriousness. The talk had two parts. The first started with some apologizes.
He add some wonderful comments like: “I had some jokes but I forgot them.” This brought a huge laugh down—we theory people just love diagonal arguments.
This part continued with some interesting comments on the nature of Theory. Some of it was advice to junior members and some advice to senior members. My favorite were:
I like these suggestions very much. I have more than once been on the receiving end of “but it is so simple.” I would like to think that I rarely have said that to someone else.
Oded then moved on to the technical part of his talk. I personally liked the first part very much and would have loved to hear more of his comments of this nature.
But Oded wanted to use this talk to also highlight some very interesting new results on proof systems. Here he spoke about On Doubly-Efficient Interactive Proof Systems. He introduced the idea by using the movie When Night is Falling. It is a Canadian Film from 1995 involving Petra and Camille. My wife, Kathryn Farley, who was sitting next to during the talk, immediately whispered to me: “what a wonderful movie” as soon as Oded put a picture of Petra and Camille on the screen. We all have our own expertise.
A proof system is called doubly-efficient if the prescribed prover strategy can be implemented in polynomial-time and the verifier’s strategy can be implemented in almost-linear-time. See here for a paper on the subject joint with Guy Rothblum. I think we will report on this material in more detail in the future, but here is part of their abstract:
A proof system is called doubly-efficient if the prescribed prover strategy can be implemented in polynomial-time and the verifier’s strategy can be implemented in almost-linear-time. We present direct constructions of doubly-efficient interactive proof systems for problems in P that are believed to have relatively high complexity. Specifically, such constructions are presented for t-CLIQUE and t-SUM. In addition, we present a generic construction of such proof systems for a natural class that contains both problems and is in NC (and also in SC).
Again congrads to Oded. Any thoughts of how the message to Ken and I was encoded?
]]>
Results of the panel at the Theory Fest
Géraud Sénizergues proved in 1997 that equivalence of deterministic pushdown automata (DPDAs) is decidable. Solving this decades-open problem won him the 2002 Gödel Prize.
Today Ken and I want to ponder how theory of computing (TOC) has changed over the years and where it is headed.
Of course we have some idea of how it has changed over the years, since we both have worked in TOC for decades, but the future is a bit more difficult to tell. Actually the future is also safer: people may feel left out and disagree about the past, but the future is yet to happen so who could be left out?
For example, we might represent the past by the following table of basic decision problems involving automata such as one might teach in an intro theory course. The result by Sénizergues filled in what had been the last unknown box:
Problem/machine | DFA | NFA | DPDA | NPDA | DLBA | DTM | |||||||
Does accept ? | In P | In P | In P | In P | PSPC | Undec. | |||||||
Is ? | In P | In P | In P | In P | Undec. | Undec. | |||||||
Is ? | In P | In P | Undec. | Undec. | Undec. | Undec. | |||||||
Is ? | In P | PSPC | In P | Undec. | Undec. | Undec. | |||||||
Is ? | In P | PSPC | Decidable | Undec. | Undec. | Undec. | |||||||
Here ‘PSPC’ means -complete. This table is central but leaves out whole fields of important theory.
At the Theory Fest this June—which we mentioned here—there will be a panel on the future of TOC. We will try to guess what they will say.
Of course we don’t know what the panel will say. They don’t necessarily give statements ahead of time like in some Senate hearings. But we can get a hint from the subjects and titles of some of the invited plenary talks, which are the last afternoon session each day:
We salute Ken’s colleague Atri among the speakers. There is also a keynote by Orna Kupferman titled, “Examining classical graph-theory problems from the viewpoint of formal-verification methods.” And there is one by Avi Widerson titled “On the Nature and Future of ToC”—which is the subject of this post.
We can get a fix on the present by looking at the regular papers in the conference program. But like Avi we want to try to gauge the future. One clear message of the above range of talks is that it will be diverse. But to say more about how theory is changing we take another look at the past.
We can divide the changes in TOC into two parts. One is
and the other is
Years ago, most of the questions we considered were basic questions about strings and other fundamental objects of computing. A classic example was one of Zeke Zalcstein’s, my mentor’s, favorite problems: the star height problem. You probably do not know this—I knew it once and still had to look it up. Here is a definition:
Lawrence Eggan seems to have been the first to raise the following questions formally, in 1963:
Regarding the first question, at first it wasn’t even known whether needed to be greater than . There are contexts in which one level of nesting suffices, most notably the theorem that one while-loop suffices for any program. Eggan proved however that is unbounded, and in 1966, Fran\c{c}ois Dejean and Marcel Schützenberger showed this for languages over a binary alphabet.
The second question became a noted open problem until Kosiburo Hashiguchi proved it decidable in 1988. His algorithm was not even elementary—that is, its time was not bounded by any fixed stack of exponentials in —but Daniel Kirsten in 2005 improved it to double exponential space, hence at worst triple exponential time. It is known to be -hard, so we might hope only faintly for a runnable algorithm, but special cases (including ones involving groups that interested Zeke) may be tractable. Narrowing the gap is open and interesting but likely to be difficult.
Do you wish you could travel back to the early 1960s to work on the original problems? Well, basically you can: Just add a complementation operator and define it to leave star-height unchanged. Then the resulting generalized star-height problem is wide open, even regarding whether suffices. To see why it is trickier, note that over the alphabet ,
so those languages have generalized star-height . Whereas, does not—it needs the one star. See this 1992 paper and these recent slides for more.
Diversifying areas are certainly giving us new domains of questions to attack. Often the new problem is an old problem with an new application. For instance, Google’s PageRank algorithm derives from the theory of random walks on graphs, as we noted here.
The novelty we find it most fruitful to realize, however, comes from changes in what we regard as solutions—the second point at the head of the last section. We used to demand exact solutions and measure worst-case complexity. Now we allow various grades of approximation. Answers may be contingent on conjectures. For example, edit distance requires quadratic time unless the Strong Exponential Time Hypothesis is false—but some approximations to it run in nearly linear time. We have talked at length about such contingencies in crypto.
A nice survey in Nature by Ashley Montanaro shows the progression within the limited field of quantum algorithms. In the classical worst-case sense, it is said that there aren’t many quantum algorithms. For a long time the “big three” were the algorithms by Peter Shor and Lov Grover and the ability of quantum computers to simulate quantum -body problems and quantum physics in general. Quantum walks became a fourth and linear algebra a fifth, but as Montanaro notes, the latter needs changing what we consider a solution to a linear system where is . You don’t get , rather you get a quantum state that approximates over a space of qubits. The approximation is good enough to answer some predicates with high probability, such as whether the same solves another system . You lose exactness but what you gain is running time that is polynomial in rather than in . A big gain is that is now allowed to be huge.
The survey goes on to problems with restricted numbers of true qubits, even zero. These problems seem important today because it has been so hard to build real quantum computers with more than a handful of qubits. Beyond the survey there are quantum versions of online algorithms and approximations of those.
If we are willing to change what we consider to be an answer, it follows that we are primed to handle fuzzier questions and goals. Online auctions are a major recent topic, and we have talked about them a little. There are many design goals: fairness, truthfulness, minimizing regret, profitability for one side or the other. Again we note that old classic problems are often best adaptable to the new contexts, such as stable-marriage graph problems with various new types of constraints.
The old classic problems never go away. What may determine how much they are worked on, however, is how well we can modify what counts as a solution or at least some progress. It seems hard to imagine partial or approximate answers to questions such as, “is logspace equal to polynomial time?”
The problem we began with about equivalence of DPDAs may be a good test case. Sénizergues gave a simple yes-answer to a definite question, but as with star-height, his algorithm is completely hopeless. Now (D)PDAs and grammars have become integral to compression schemes and their analysis—see this or this, for instance. Will that lead to important new cases and forms of the classic problems we started with? See also this 2014 paper for PDA problem refinements and algorithms.
What are your senses of the future of ToC?
]]>
The problem of mining text for implications
2016 RSA Conference bio, speech |
Michael Rogers, the head of the National Security Agency, testified before the Senate Intelligence Committee the other day about President Donald Trump. He was jointed by other heads of other intelligence agencies who also testified. Their comments were, as one would expect, widely reported.
In real time, I heard Admiral Rogers’s comments. Then I heard and read the reports about them. I am at best puzzled about what happened.
The various reports all were similar to this:
Adm. Michael S. Rogers, the head of the National Security Agency, also declined to comment on earlier reports that Mr. Trump had asked him to deny possible evidence of collusion between Mr. Trump’s associates and Russian officials. He said he would not discuss specific interactions with the president.
The above quote is accurate—Adm. Rogers did not discuss specific interactions with the president. But I have trouble with this statement. The problem I have is this:
Are statements made in a Senate hearing subject to the basic rules of logic?
For example, if a person says and later says , can we conclude that he or she has effectively said ?
Let’s look at the testimony of Adm Rogers. He insisted that he could not recall being pressured to act inappropriately in his almost three years in the post. “I have never been directed to do anything I believe to be illegal, immoral, unethical or inappropriate,” he said.
During his three years as head of the NSA he worked under President Obama and now President Trump. So I see the following logical argument. Since he has never been asked to do anything wrong during that period, then it follows that Trump never asked him to do anything wrong.
This follows from the rule called universal specification or universal elimination. If is true, then for any in the set it must follow that is true.
What is going on here? The reports that he refused to answer ‘is true?’ are correct. But he said a stronger statement in my mind that is true. Is it misleading reporting? Or do the rules of logic not apply to testimony before Senate committees? Which is a stronger statement:
or
where is an element of ?
In mathematics the latter statement is stronger, but it appears not to be so in the real world. The statement is more direct. What does this say about logic and its role in human discourse?
Ken recalls a course he took in 1979 from the late Manfred Halpern, a professor of politics at Princeton. Titled “Personal and Political Transformation,” the course used a set of notes that became Halpern’s posthumous magnum opus.
The notes asserted that components of human relationships can be classed into eight basic modalities, the first three being paradigms for life: emanation, incoherence, transformation, isolation, subjection, direct bargaining, boundary management, and buffering. The first three form a progression exemplified by Dorothy and the wizard vis-à-vis Glinda and the ruby slippers in The Wizard of Oz; later he added deformation as a ninth mode and fourth paradigm and second progression endpoint. It particularly struck Ken that presenting mathematical proofs is classed as a form of subjection: You can’t argue or bargain with a proof or counterexample.
Buffering made and remained in his list. He showed how each member is archetypal in human history and depth psychology. So Ken’s answer is that the one-step-remove of saying “” rather than “” is a deeply rooted difference. It makes wiggle-room that a jury of peers might credit in a pinch.
Psychology aside, the mining of logical inferences is a major application area. Sometimes the inference is outside the text being analyzed, such as when “chatter” is evaluated to tell how far it may imply terrorist threats. We are interested in cases where the deduction is more inside. For instance, consider this example in a 2016 article on the work of Douglas Lenat:
A bat has wings. Because it has wings, a bat can fly. Because a bat can fly, it can travel from place to place.
One might say that underlying this is the logical rule
One of the problems, however, is that even if we limit the set to animals, the rule is false—there are many flightless birds. This leads into the whole area of non-monotonic logic which is a topic for another day—but good to bear in mind when revelations from hearings revise previously-held beliefs.
Ken has been dealing this week with an example at the juncture of the logic of time and human language. He had to evaluate twenty pages of testimony about a recurring behavior . In one place it states that occurred at time and occurred “once again” at time . The question is whether one can infer and apply the rule
This was complicated by the document having been translated from a foreign language. Whether time was the next occurrence of after time makes a difference to results Ken might give. Of course this may be clarified in a further round of testimony—but we could say the same about Admiral Rogers, and he has left the stand.
How soon will we have apps that can take statements of the form and deduce for a particular that we want to know about? Will inferences from “material implication” be considered material in testimony?
Update, 11:15pm 6/8/17: CNN has just told of a woman they interviewed about the contradictions between Donald Trump and James Comey. Asked if she believes Comey lied, she replied, “No.” Asked if she believes Trump lied, she replied, “No.” Asked how that could be, she said: “The media has distorted it.” Thus the logical law of excluded middle is replaced by a “law of occluded media,” which blocks constructive inference…
Wikimedia Commons source |
Robert Southey was the Poet Laureate of Britain from 1813 until his death in 1843. He published, anonymously, “The Story of the Three Bears” in 1837.
Today Ken and I want to talk about the state of versus and the relationship to this story.
The story, as I’m sure you know, is about Goldilocks. She has—no surprise—curly blond hair. She enters the home of three bears while they are away. She tries their chairs, eats some of their porridge, and falls asleep on one of their beds. When the bears return she runs away.
What you may not know is that Southey’s original story had not a young girl but an old woman. She is not innocent but furtive, self-serving, and meddlesome: she breaks the little bear’s chair and eats his breakfast. An 1849 retelling by Joseph Cundall changed her into a girl named “Silver-hair” and changed her motives to restlessness and curiosity. Her hair changed to gold around 1868 but she did not acquire the name “Goldilocks” until 1904.
Of course there is no change to our classic problem: Claims continue that there are proofs that , claims continue that there are proofs that , and claims continue that —but without offering any proof. What connection can there be to the Goldilocks story? It is in the telling—the literary rule of three augmented with a total order.
The Goldilocks tale is really one of . It has her try at each stage: chairs, food, and beds. At each stage of the story, one item is too big-or-hot-or-hard, one is too small-or-cold-or-soft, and one is just right. Then the bears follow the same sequence in discovering her traipsing.
The “just right” aspect has been named the Goldilocks Principle. Christopher Booker’s oft–quoted description of the “dialectical three” goes as follows:
“[T]he first is wrong in one way, the second in another or opposite way, and only the third, in the middle, is just right. … This idea that the way forward lies in finding an exact middle path between opposites is of extraordinary importance in storytelling.”
The Goldilocks Principle however leads, according to this neat history on the LetterPile website, to what it calls the “Goldilocks Syndrome”:
“We are living in consumerism, where big companies non-stop create billions of realities, where everybody … can feel ‘just right.’ … The problem starts when we can’t stop looking for perfect solutions in [this] pretty imperfect world.”
It is not clear whether they have a solution, but they go on to describe and recommend the following “Goldilocks Rule”:
“Balance between known and unknown, risky and risk-free, predictable and unpredictable.”
Our take on all this is: are we trying to be “too perfect” in our approach to versus ? Can we profitably strike a new balance?
I have argued recently and before that is possible, but with an algorithm for say that is galactic—see here for our introduction of this term—meaning an algorithm that is completely useless. Here are three perspectives on the power of the two classes.
Lemma 1 There is a constant so that if is in , then has a Boolean circuit of size a most .
Note that the consequence is easy to show: Assume and that there is such a constant . Then this contradicts Ravi Kannan’s famous theorem that the polynomial hierarchy has sets that require boolean circuits for any . (See this 2009 paper for more.) In terms of our theme: is too weak to be equal to —the bowl is too small.
We can start by regarding the “three barriers”—relativization, natural proofs, and algebrization—as effects of such masquerading. Then one can focus further on the extent to which -objects can be approximated by polynomial-time ones. Many -complete problems are easy in average case under certain natural distributions. We wonder whether the theory can be structured to say that logspace objects, or ones from (not to mention ) cannot approximate so well. An example of a technical issue to overcome is that languages like are -hard but approximated ultra-simply by the language of all strings.
Of course, it would be a huge breakthrough already to separate from uniform , let alone from logspace or . We’re suggesting instead to think along these lines:
So, which bowl? Which bed to lie in? Most seem to believe the second is how to prove but who knows. At least we’re trying not to run away.
The famous front-and-back cover art of the venerable 1979 textbook by John Hopcroft and Jeffrey Ullman is said to depict Cinderella:
We believe Goldilocks fits better: curly hair, wearing boots not slippers, and breaking things. Both of us recall general optimism about solving versus at the time the text was published. Now the artwork seems prophetic on what happens when we tug at the question. Can we get a “middle-way” approach to it up and functioning?
While on the subject of textbooks, we are happy to note that our textbook Quantum Algorithms Via Linear Algebra received a second printing from MIT Press, in which all of our previous errata have been corrected.
UK Independent source—and “a gentle irony” |
Roger Bannister is a British neurologist. He received the first Lifetime Achievement Award from the American Academy for Neurology in 2005. Besides his extensive research and many papers in neurology, his 25 years of revising and expanding the bellwether text Clinical Neurology culminated in being added as co-author. Oh by the way, he is that Bannister who was the first person timed under 4:00 in a mile race.
Today I cover another case of “Big Data Blues” that has surfaced in my chess work, using a race-timing analogy to make it general.
Sir Roger also served as Head of Pembroke College, one of the constituents of Oxford University. He was one of three august Rogers with whom I interacted about IBM PC-XT computers when the machines were installed at Oxford in 1984–1985. Sir Roger Penrose was among trustees of the Mathematical Institute’s Prizes Fund who granted support for my installation of an XT-based mathematical typesetting system there, a story I’ve told here. Roger Highfield and his secretary used an XT in my college’s office, and I was frequently called in to troubleshoot. While drafting this post last month, I received a mailing from Sir Martin Taylor that Dr. Highfield had just passed away—from his obit one can see that he, too, received admission to a royal order.
Dr. Bannister was interested in purchasing several XTs for scientific as well as general purposes at Pembroke. At the time, numerical performance required purchasing a co-processor chip, adding almost $1,000 to what was already a large outlay per machine by today’s standards. I wish I’d thought to say in a quick deadpan voice, “let it run four minutes and it will give you a mile of data.” (Instead, I think the 1954 race never came up in our conversation.) Today, however, data outruns us all. How to keep control of the pace is our topic.
Roger Bannister 50-year commemorative coin. Royal Mint source. |
As shown above in the commemorative coin’s design, the historic 3:59.4 time was recorded on stopwatches. We’ll stay with this older timing technology for our example.
Suppose you have a field of 200 milers. Suppose you also have a box of 50 stopwatches. For each runner you pick a stopwatch at random and measure his/her time. You get results that closely match the histogram of times that were recorded for the same runners in trials the previous day.
How good is this? You can be satisfied that the box of watches does not have a systematic tendency to be slow or to be fast for runners at that mix of levels. Projections based on such fields are valid.
The rub, however, is that you could have gotten your nice fit even if each individual watch is broken and always returns the same time. Suppose your field included Bannister, John Landy, Jim Ryun, and Sebastian Coe, with each in his prime. They would probably average close to 3:55. Hence if one of the 50 watches is stuck on 3:55, it will fit them well. It doesn’t matter if you actually draw the watch when measuring the last-place finisher. The point is that you expect to draw the watch 4 times overall and are fitting an aggregate.
Indeed, you only need the distribution of the (stopped) watches to match the distribution of the runners under random draws. You may measure a close fit not only in the quantiles but also the higher moments, which is as good as it gets. Your model may still work fine on tomorrow’s batch of runners. But at the non-aggregate level, what it did in projecting an individual runner was vapid.
Here is a hypothetical example in the predictive analytic domain of my chess model. Consider a model used by a home insurance company to judge the probabilities of damage by earth movement, wind, fire, or flood and price policies accordingly. I’ve seen policies with grainy risk-scale levels that apply to several hundred homes in a given area at one time. The company only needs good performance on such aggregates to earn its profit.
But suppose the model were fine-grained enough to project probabilities on individual homes. And suppose it did the following:
This is weird but might not be bad. If the risks average out over several hundred homes, a model like this might perform well—despite the consternation homeowners would feel if they ever saw such individual projections.
Of course, “real” models don’t do this—or do they? The expansion of my chess model which I described last Election Day has started doing this. It fixates on some moves but gives near-zero probability to others—even ones that were played—while giving fits 5–50x sharper than before. If you’ve already had experience with behavior like the above, please feel welcome to jump to the end and let us know in comments. But to see what lessons to learn from how this happens in my new model, here are details…
My chess model assigns a probability to every possible move at every game turn , based only on the values given to those moves by strong computer chess programs and parameters denoting the skill profile of a formal player . The programs list move options in order of value for the player to move, so that is the raw inferiority of in chess-specific centipawn units.
The model asserts that the parameters can be used to compute dimensionless inferiority values , from which projected probabilities are obtained without further reference to either parameters or data. The old model starts with a function that scales down the raw difference according to the overall position value . Then it defines
Lower and higher both decrease the probability of playing a sub-optimal move by dint of driving higher. The effect of is greatest when is low, so is interpreted as the player’s “sensitivity” to small differences in value, whereas governs the frequency of large mistakes and hence is called “consistency.” My conversion represents each as a power of the best-move probability , namely solving the equations
where is the number of legal moves in the position. The double exponential looks surprising but can be broken down by regarding as a “utility share” expressed in proportion to the best move’s utility , then . Alternate formulations can define directly from and the parameters, e.g. by , and/or simply normalize the shares by rather than use powers, but they seem not to work as well.
This “inner loop” defines as a probability ensemble given any point in the parameter space. The “outer loop” of regression needs to determine the that best conforms to the given data sample. The determine projections for the frequency of “matching” the computer’s first move and the “average scaled difference” of the played moves by:
The regression makes these into unbiased estimators by matching them to the actual values and in the sample. We can view this as minimizing the least-squares “fitness function”
where the weights on the individual tests are fixed ad-lib. In fact, my old model virtually always gets , thus solving two equations in two unknowns. Myriad alternative fitness functions using other statistical tests and weights help to judge the larger quality of the fit and cross-validate the results.
In my original model, all is good. My training sets for a wide spectrum of Elo ratings yield best-fit values that not only give a fine linear fit with residuals small across the spectrum, but the individual sequences and also give good linear fits to . Moreover, for all and positions the projected probabilities derived from have magnitudes that spread out over the reasonable moves .
My old model is however completely monotone in this sense: The best move(s) always have the highest , regardless of . Moreover, an uptick in the value of any move increases for every . This runs counter to the natural idea that weaker players prefer weaker moves.
The new model postulates a mechanism by which weaker moves may be preferred by dint of looking better at earlier stages of the search. A new measure called “swing” is positive for moves whose high worth emerges only late in the search, and negative for moves that look attractive early on but end with subpar values. The latter moves might be “traps” set by a canny opponent, such as the pivotal example from the 2008 world championship match discussed here.
A player’s susceptibility to “swing” is modeled by a new parameter called for “heave” as I described last November. The basic idea is that represents the “subjective value” of the move , so that represents the subjective difference in value. The idea I actually use applies swing to adjust the inferiority measure:
where is a fourth parameter and for negative is defined to be . Dropping from the second term and raising it to just the not power would be mathematically equivalent, but coupling the parameters makes it easier to try constraining and/or . (In fact, I’ve tried various other combinations and tweaks to the formulas for and , plus four other parameters kept frozen to default values in examples here. None so far has changed the picture described here.)
Note that the formulas for preserve the property for the first-listed move . When has equal-optimal value, that is , cannot be negative and is usually positive. That makes and hence reduces the share compared to . The first big win for the new model is that it naturally handles a puzzling phenomenon I identified years ago, for which my old model makes an ad-hoc adjustment.
The second big win is that can be negative even when —the swing term overpowers the other. This means the model projects the inferior move as more likely than the engine’s optimal move. This is nervy but in many cases my model correctly “foresees” the player taking the bait.
The third big win—but tantalizing—is that the extended model not only allows solving 2 more equations but often makes other fitness tests align like magic. The first of the following choices of extra equations makes an unbiased estimator for the frequency of playing a move of equal value to , which became my third cheating test after its advocacy in this paper (see also reply in this):
A typical fit that looks great by all these measures is here. It has 26,450 positions from all 497 games at standard time controls with both players rated between Elo 2040 and Elo 2060 since 2014 that are collected in the encyclopedic ChessBase Big 2017 data disc. It shows for to , then and tests related to it, then is repeated between and , and finally come four cases of for 0.01–0.10, 0.11–0.30, 0.31–0.70, and 0.71–1.50, plus four with .
Only , , , and were fitted on purpose. All the other tests follow closely like baby ducks in a row, except for some like captures and advancing versus retreating moves where human peculiarities may be identified. The value of is 5–10x as sharp as what my old model typically achieves. The new model seems to be confirming itself across the board and fulfilling the goal of giving accurate projected probabilities for all moves, not just the best move(s). What could possibly be amiss?
The first hint of trouble comes from the fitted value of being . In my old model, players rated 2050 give between and , while even the best players give . Players rated 2050 are in amateur ranks and leaves no headroom for masters and grandmasters. The value of compounds the sharpness; together with , a slight value difference (say) gets ballooned up to , giving and , which shrinks near 1-in-5,000 when and below 1-in-650,000 when . This is weirdly small—and we have not even yet involved the effects of the swing term with .
Those effects show up immediately in the file. I skip turns 1–8, so White’s 9th move is the first item. In the following position at left, Black has just captured a pawn and White has three ways to re-take, all of them reasonable moves according to the Stockfish 7 program.
Positions in game Franke-Doennebrink, 1974 at White’s 9th move (left) and Black’s 11th (right). |
Here is how my new model projects them:
NRW Class1 1314;Germany;2014.02.02;6.4;Franke, Thomas;Doennebrink, Elmar;1-0 r1b1k2r/pp2bppp/2n1pn2/3q4/3p4/2PBBN2/PP3PPP/RN1Q1RK1 w kq - 0 9; c3xd4, engine c3xd4 Eval 0.24 at depth 21; swap index 1 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 c3xd4: 0.00 0.000 0.000 0.000 0.000 1 0.79527569 2 2 Nf3xd4: 0.42 0.321 -0.035 -0.034 -0.034 0.144422 0.20472428 3 3 Be3xd4: 0.55 0.395 0.008 0.005 0.005 0.00445313 0.00000001
That’s right—it gives zero chance of a 2050-player taking with the Bishop, even though Stockfish rates that only a little worse than taking with the Knight. True, human players would say 9.Bxd4 is a stupid move because it lets Black gain the “Bishop pair” by exchanging his Knight for that Bishop. Of 155 games that ChessBase records as reaching this position, 151 saw White recapture by 9.cxd4, 4 by 9.Nxd4, and none by 9.Bxd4. So maybe the extremely low projection—for 9.Bxd4 and all other moves—has a point. But to give zero? The is the utility share, so the is actually about ; the is an imposed minimum. My original model—setting and fitting only and —spreads out the probability nicely, maybe even too much here:
M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 c3xd4: 0.00 0.000 0.000 0.000 0.000 1 0.57620032 2 2 Nf3xd4: 0.42 0.321 -0.035 -0.034 -0.034 0.280586 0.14018157 3 3 Be3xd4: 0.55 0.395 0.008 0.005 0.005 0.241178 0.10168579
At Black’s 11th turn, however, the new model gives three clearly wrong “zero” projections:
NRW Class1 1314;Germany;2014.02.02;6.4;Franke, Thomas;Doennebrink, Elmar;1-0 r1b2rk1/pp2bppp/2nqpn2/8/3P4/P1NBBN2/1P3PPP/R2Q1RK1 b - - 0 11; Rf8-d8, engine b7-b6 Eval 0.11 at depth 20; swap index 1 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 b7-b6: -0.00 -0.000 0.000 0.000 0.000 1 0.56792559 2 2 Nf6-g4: 0.18 0.163 -0.001 -0.002 -0.002 0.0907468 0.00196053 3 3 Rf8-d8: 0.18 0.163 0.042 0.046 0.046 0.00391154 0.00000001 4 4 Bc8-d7: 0.21 0.187 -0.029 -0.030 -0.030 0.278053 0.13071447 5 5 Nf6-d5: 0.28 0.241 0.047 0.050 0.050 0.00218845 0.00000001 6 6 a7-a6: 0.30 0.256 -0.049 -0.051 -0.051 0.28777 0.14001152 7 7 Qd6-c7: 0.31 0.264 -0.012 -0.012 -0.012 0.097661 0.00304836 8 8 g7-g6: 0.37 0.306 0.015 0.017 0.017 0.00355675 0.00000001 9 9 Qd6-d8: 0.39 0.320 -0.054 -0.051 -0.051 0.206264 0.06438231 10 10 Qd6-b8: 0.39 0.320 -0.037 -0.038 -0.038 0.158031 0.02787298
Owing to many other games having “transposed” here by a different initial sequence of moves, Big 2017 shows 911 games reaching this point. In 683 of them, Black played the computer’s recommended 11…b6. None played the second-listed move 11…Ng4, which reflects well on the model’s giving it a tiny . But the third-listed move 11…Rd8 gets a zero despite having been chosen by 94 players. Then 91 played the sixth-listed 11…a6, which actually gets the second-highest nod from the model, and 22 played 11…Bd7, which the new model considers third most likely. But 12 players chose 11…Nd5, four of them rated over 2300 including the former world championship candidate Alexey Dreev in a game he won at the 2009 Aeroflot Open. My old model’s fit of the same data gives 34.8% to 11…b6, 10.4% to 11…Ng4 and 7.5% to 11…Rd8 with the ad-hoc change for tied moves (would be 8.7% to both without it), and 5.1% to 11…Nd5, with eighteen moves getting at least 1%.
To be sure, this is a well-known “book” position. The 75% preference for 11…b6 doubtless reflects players’ knowledge of past games and even the fact that Stockfish and other programs consider it best. It is hard to do a true distributional benchmark of my model in selected positions because the ones with enough games are exactly the ones in “book.” Studies of common endgame positions have been tried then and now, but with the issue that the programs’ immediate complete resolution of these endgames seems to wash out much of the progression in thinking and differentiation of player skill that one would like to capture. (My cheating tests exclude all “book-by-2300+” positions and all with one side ahead more than 3.00.) Most to the point, the fitting done by my model on training data is supposed to be already the distributional test of how players of that rating class have played over many thousands of instances.
The following position is far from book and typifies the most egregious kind of mis-projection:
SVK-chT1E 1314;Slovakia;2014.03.23;11.6;Debnar, Jan;Milcova, Zuzana;1-0 2r4k/pp5p/2n5/2P1p2q/2R1Qp1r/P2P1P2/1P3KP1/4RB2 b - - 1 32; Qh5-g5, engine Qh5-g5 Eval 0.01 at depth 21; swap index 2 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 Qh5-g5: 0.00 0.000 0.129 0.137 0.000 0.0347142 0.00018527 2 2 Rc8-d8: 0.00 0.000 0.026 0.025 -0.112 1 0.74206054 3 3 a7-a6: 0.09 0.085 -0.008 -0.005 -0.142 0.117659 0.07922164 4 4 Rc8-g8: 0.21 0.187 -0.092 -0.087 -0.225 0.0996097 0.05003989 5 5 Rh4-h1: 0.25 0.219 -0.166 -0.165 -0.302 0.136659 0.11270482
This has two tied-optimal moves for Black in a position judged +0.01 to White, not a flat 0.00 draw value, yet the one that was played gets under a 1-in-5,000 projection. Here are the by-depth values that produced the high positive value:
-------------------------------------------------------------------------------------------- 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -------------------------------------------------------------------------------------------- Qg5 -117 -002 -008 +000 -015 +008 +017 +032 +011 +007 +000 +004 +001 +000 +000 +006 +001 Rd8 -089 -058 -036 -032 -013 -025 +006 +000 -010 +000 -012 -013 +001 +014 +001 +001 +001
The numbers are from White’s view, so what happened is that 32…Rd8 looked like giving Black the advantage at depths 10, 13, and 15–16, whereas 32…Qg5 looked significantly inferior (to Stockfish 7) at depth 12 and nosed in front only at depth 20 just before falling into the tie. The swing computation begins at depth 10 to evade the Stockfish-specific strangeness I noted here last year, so in particular the “rogue” values at depth 5 (and below) are immaterial. The values and differences from depth 10 onward are all relatively gentle. Hence their amounting to a tiny versus and microscopic is a sudden whiplash.
What I believe is happening to the fit is hinted by this last example giving the highest probability to the 2nd-listed move. Our first game above has two positions where the 9th-listed move gets the love. (The second, shown in full here, is notable in that the second-best move gets a zero though it is inferior by only 0.03 and was played by all three 2200+ players in the book.) This conforms to the goal of projecting when weaker players will prefer weaker moves.
This table shows that the new model quite often prefers moves other than , compared to how often they are played:
To be sure, the model is not putting 100% probability on these preferred moves, but when preferred they get a lot more probability than under my old model, which never prefers a move other than . Recall however that my old model’s fit was not too far off on these indices—and both models are fitted to give the same total probability to over all positions . Hence the probability on inferior moves is conserved but more concentrated.
Yes, greater concentration was the goal—so as to distinguish the most plausible inferior moves. But the above examples show a runaway process. The new model seems to be seizing onto properties of the distribution alone. For each we can define to be the move with the most negative value of . The also form a histogram over . The fitting process can grab it by putting all weight on plus at most a few other moves at each turn .
These few moves are the “stopped-watch reading” in my analogy. The moves given zero are the readings that cannot happen for a given runner/position. The fitting doesn’t care whether moves getting zero were played, so long as other turns fill in the histogram. If a high for —as with 32…Rd8 above—fills a gap, the fit will gravitate toward values of and that beat down all the moves with at such turns . In trials on other data, I’ve seen crash under while zooms aloft in a crazy race.
What can fix this? The maximum likelihood estimator (MLE) in this case involves minimizing the log-sum of the projected probabilities of the moves played at turn . Adding it as a weighted component of helps a little by inflating the probability of the moves that were played, but so far not a lot. Even more on-point may be maximum entropy (ME) estimation, which in this case means minimizing
There are various other ways to fit the model, including a quantiling idea I devised in my AAAI 2011 paper with Guy Haworth. In principle, and because the training data is copious, it is good to have these ways agree more than they do at present. Absent a lightning bolt that fuses them, I am finding myself locally tweaking the model in directions that optimize some “meta-fitness” function composed from all these tests.
Is this a known issue? Does it have a name? Is there a standard recipe for fixing it?
Do any deployed models have similar tendencies that aren’t noticed because there isn’t the facility for probing deeper into the grain that my chess model enjoys?
[added “at standard time controls”, a few other word changes, added game diagrams]
Cropped from source |
Bill Clinton was the 42nd President of the United States. He came close to becoming the first First Gentleman—or whatever we will call the husband of a female president. He is also a fan of crossword puzzles, and co-authored with Victor Fleming a puzzle for this past Friday’s New York Times.
Today we discuss an apparently unintended find in his puzzle. It has a Mother’s Day theme.
The puzzle was widely publicized as having a “secret message” or “Easter egg.” Many crossword puzzles have a theme constituted by the longer answers, but the Friday and Saturday NYT puzzles are usually themeless. They are also designed to be the hardest in a progression that begins with a relatively easy Monday puzzle each week. The online renditions are subscriber-only, but the Times opened this puzzle freely to the public, so you are welcome to try to solve it and find the “hidden” content before we give it away.
In a previous post we featured Margaret Farrar, the famous first crossword editor for the Times, and described how the puzzles look and work. Proper nouns such as CHILE the country, standard abbreviations, and whole phrases are fair game as answers, and they are rammed together without spaces or punctuation. For instance, the clue “Assistance for returning W.W. II vets” in Clinton’s puzzle produces the answer GIBILL. (My own father, returning from the occupation of Japan, completed his college degree under the G.I. Bill.) Some clues are fill-in-the-blank, such as “Asia’s ____ Sea” in the puzzle.
The intended hidden message is formed from three long answers symmetrically placed around the puzzle’s center. It is the signature line from a 1977 Fleetwood Mac song that Clinton has used since his 1992 presidential campaign. If you expected the puzzle to have a theme, these three lines would obviously be it.
An “Easter egg” is a side feature, usually small and local and often, as Wikipedia says, an inside joke. When I printed and did the puzzle over lunch on Friday, I missed the intended content because it wasn’t the kind I was looking for. But I did find something one can call an “Eester gee” involving the three shorter clues and answers mentioned above:
My eye had been drawn by finding Bill in his own puzzle. Winding through him is HILLAREE, indeed in three different ways but with EE in place of Y. Straining harder, one can extract CHEL- from CHILE and get -Sea from the clue for ARAL just underneath to find Chelsea, the Clintons’ only daughter.
Admittedly this is both stilted and cryptic, but it is singularly tied to the former First Family and appropriate just before Mother’s Day. Was this hidden by intent, or was it hiding by accident? Presuming the latter, what does this say about the frequency with which we can find unintended patterns? This matters not only to some historical controversies but also to cases of alleged plagiarism of writing and software code, even this investigation over song lyrics being planted in testimony.
Can we possibly judge the accidental frequency of such subjective patterns? Clinton’s puzzle allows us to experiment a little further. His only grandchild, Chelsea’s daughter, is named Charlotte. Can we find her in the same place?
Right away, CHILE and ARAL give us CHAR in a square, a promising start. There are Ls nearby, but no O. Nothing like “Lenya” or a ‘Phantom‘ reference is there to clue LOTTE. The THREE in our grid is followed by TON to answer the clue, “Like some heavy-duty trucks,” but getting the last four needed letters from there lacks even the veneer of defense of my using the I in CHILE as a connector. Is three tons a “lot”? No doting grandpa would foist that on a child. So we must reject the hypothesis that she is present.
We can attack the CHILE weakness in a similar manner. The puzzle design could have used CHELL, the player character of the classic video game Portal. HILLAREE would still have survived by using the I in Bill. However, the final L would have come below the N in the main-theme word THINKING, and it is hard to find natural answer words ending in NL. So our configuration has enough local optimality to preserve the contention that Chelsea is naturally present. Whether it is truly natural remains dubious, but it dodges this shot at refutation.
Going back, how should we regard the false-start on Charlotte? We should not be surprised that it got started. That she shares the first two letters with Chelsea may have been “correlated” if not expressly purposeful. Such correlations are a major hard-to-handle factor in cases of suspected plagiarism or illicit signaling, as both Dick and I can attest generally from experience.
Of course, this is more the stuff of potboilers and conspiracy theories than serious research. That hasn’t stopped it from commanding the input of some of our peers, however. The best-selling 1997 book The Bible Code, following a 1994 paper, alleges that sequences of Hebrew letters at fixed-jump intervals in the Torah—the first five books of the Hebrew Bible—form sensible prophetic messages to a degree far beyond statistical expectation.
The fact that Hebrew skips many vowels helps in forming patterns. For instance, arranging the start of Genesis into a 50-column crossword yields TORaH in column 6, and as Wikipedia notes here, exactly the same happens in column 8 at the start of Exodus. Even just among the consonants, some alleged messages have glitches and skips like ours with HILLAREE and CHILE. Where is the line between patching-and-fudging and true statistical surprise? Our friend Gil Kalai was one of four authors of a 1999 paper delving deep into the murk. They didn’t just critique the 1994 paper, they conducted various experiments. Some were akin to ours above with CHARLOTTE, some could be like trying to find unsavory Clinton associations in the same puzzle, and the largest was replicating many of the same kind of finds in a Hebrew text of War and Peace.
The controversy over the genesis of William Shakespeare’s plays has notoriously involved allegedly hidden messages, most famously stemming from the 1888 book The Great Cryptogram supporting Francis Bacon as their true author. Two other major claimants, Edward de Vere (the seventeenth Earl of Oxford) and Christopher Marlowe, are hardly left out. Indeed, they both get crossword finds in the most prominent place of all, the inscription on Shakespeare’s funerary monument in Stratford, England:
The inscription is singular in challenging the “passenger” (passer-by) to “read” who is embodied within the Shakespeare monument. His tomb proper is nearby in the ground. Supporters of de Vere arrange the six parts of the Latin preface into a crossword and find their man in column 2:
The leftover OL is a blemish but it might not be wasted—it could refer to “Lord Oxford” in like manner to how “Mr. W.H.” in the dedication to Shake-speares Sonnets plausibly refers to Henry Wriothesley, the Earl of Southampton, who was entreated to marry one of Oxford’s daughters throughout 1590–1593.
Supporters of Marlowe volley back in the style of a British not American crossword. Their answer constructs this part of the inscription as a cryptic-crossword clue:
Whose name doth deck this tomb, far more, then cost.
The only name on Shakespeare’s tomb is Jesus, and the Oxford English Dictionary registers ley as an old word for a bill or tax, generically a cost. The answer to the monument’s riddle thus becomes CHRISTO-FAR MORE-LEY, which is within the convex hull of how Marlowe’s name was spelled in his lifetime. The subsequent SIEH, which is most simply explained as a typo for SITH meaning “Truly,” is constructed by modern cryptic-crossword convention as “HE IS returned,” in line with theories that Marlowe’s 1593 murder was actually staged to put him under deep cover in the Queen’s secret service.
What to make of these two readings? The only solid answer Dick and I have is the same as when we are sent a claimed proof of one week and one of the next:
They can’t both be right.
Or—considering that Marlowe has recently been credited as a co-author of Shakespeare’s Henry VI cycle, and that William Stanley, who completes Wikipedia’s featured quartet of claimants, wound up marrying the above-mentioned daughter of Oxford—perhaps they can.
Where do you draw the lines among commission, coincidence, and contrivance? Where does my Clinton crossword finding fall?
Happy Mother’s Day to you and yours as well.
[fixed description of Chell character, “seventh”->”seventeenth”, added ref. to song-lyrics case, some wording tweaks]
Alternate photo by Quanta |
Thomas Royen is a retired professor of statistics in Schwalbach am Taunus near Frankfurt, Germany. In July 2014 he had a one-minute insight about how to prove the famous Gaussian correlation inequality (GCI) conjecture. It took one day for him to draft a full proof of the conjecture. It has taken several years for the proof to be accepted and brought to full light.
Today Ken and I hail his achievement and discuss some of its history and context.
Royen posted his paper in August 2014 with the title, “A simple proof of the Gaussian correlation conjecture extended to multivariate gamma distributions.” He not only proved the conjecture, he recognized and proved a generalization. The “simple” means that the tools needed to solve it had been available for decades. So why did it elude some of the best mathematicians for those decades? One reason may have been that the conjecture spans geometry, probability theory, and statistics, so there were diverse ways to approach it. A conjecture that can be viewed in so many ways is perhaps all the more difficult to solve.
Even more fun is that Royen proved the conjecture after he was retired and had the key insight while brushing his teeth—as told here. Ken recalls one great bathroom insight not in his research but in chess: In the endgame stage of the famous 1999 Kasparov Versus the World match, which became a collaborative research activity later described by Michael Nielsen in his book, Reinventing Discovery, Ken had a key idea while in the shower. His idea, branching out from the game at 58…Qf5 59. Kh6 Qe6, was the Zugzwang maneuver 60. Qg1+ Kb2 61. Qf2+ Kb1 62. Qd4!, which remains the only way for White to win.
Although solutions often come in a flash, the ideas they resolve often germinate from partial statements whose history takes effort to trace. One thing we can say is that the GCI does not originate with Carl Gauss, nor should it be considered named for him. A Gaussian measure on (centered on the origin) is defined by having the probability density
where is a non-singular covariance matrix and just means the transpose of . Its projection onto any component is a usual one-variable normal distribution.
Suppose is a 90% confidence interval for and a 90% confidence interval for another variable . What is the probability that both variables fall into their intervals? If they are independent, then it is .
What if they are not independent? If they are positively correlated, then we may expect it to be higher. If they are inversely related, well…let’s also suppose the variables have mean and the intervals are symmetric around : , . Do we still get ? This—extended to any subset of the variables with any smattering of correlations and to other shapes besides the products of intervals—is the essence of the conjecture.
Charles Dunnett and Milton Sobel considered some special cases, such as when is an outer product for some vector , which makes it positive definite. Their 1955 paper is considered by some to be the source of GCI.
But it was Olive Dunn who first posed the above general terms in a series of papers that have had other enduring influence. The first paper in 1958 and the second in 1959 bore the like-as-lentils titles:
These seem to have generated confusion. The former is longer and frames the confidence-interval problem and is the only one to cite Dunnett-Sobel, but it does not mention a “conjecture.” The latter does discuss at the end exactly the conjecture of extending a case she had proved for to arbitrary , but relates a reader’s counterexample. Natalie Wolchover ascribed the 1959 paper in her article linked above, but Wikipedia and other sources reference the 1958 paper, while subsequent literature we’ve seen has instances of citing either—and never both.
Dunn became a fellow of the American Statistical Association, a fellow of the American Association for the Advancement of Science (AAAS), and a fellow of the American Public Health Association. In 1974, she was honored as the annual UCLA Woman of Science, awarded to “an outstanding woman who has made significant contributions in the field of science.” Her third paper in this series, also 1959, was titled “Confidence intervals for the means of dependent normally distributed variables.” Her fourth, in 1961, is known for the still-definitive form of the Bonferroni correction for joint variables. But in our episode of “CSI: GCI” it seems we must look later to find who framed the conjecture as we know it.
Not an ad. Amazon source. So is it an ad? |
Sobel came back to the scene as part of a 1972 six-author paper, “Inequalities on the probability Content of Convex Regions for Elliptically Contoured Distributions.” They considered integrals of the form
for general functions besides and for general positive definite . GCI in this case then has the form where is the identity matrix. They call elliptically contoured provided is finite. Writing about the history, they say (we have changed a few symbols and the citation style):
Inequalities for perhaps originate with special results of Dunnett and Sobel (1955) and of Dunn (1958), in which it is shown that for special forms of (with ) or for special values of .
They mention also an inequality by David Slepian and what they termed “the most general result for the normal distribution” by Zbynek Šidák, still with special conditions on . Their main result is “an extension of Šidák’s result to general elliptically contoured densities [plus] a stronger version dealing with a convex symmetric set.” This is where the relaxation from products of confidence intervals took hold. At last, after their main proof in section 2 and discussion in section 3, we find the magic word “conjecture”:
This suggests the conjecture: if is a random vector (with of dimension and of dimension ) having density and if and are convex symmetric sets, then
where
Clearly by iteration this implies the inequality with regard to . Here symmetric means just that belongs whenever belongs. Any symmetric convex set can be decomposed into strips of the form for fixed and , which their generality set them up to handle, and proving the inequality for strips suffices. This is considered the modern statement of GCI. The rest of their paper—over half of it—treats attempts to prove it and counterexamples to some further extensions.
Finally in 1977, Loren Pitt proved the case , referencing the 1972 paper and Šidák but not Dunnett-Sobel or Dunn. Wolchover interviewed Pitt for her article, and this extract is revealing:
Pitt had been trying since 1973, when he first heard about [it]. “Being an arrogant young mathematician … I was shocked that grown men who were putting themselves off as respectable math and science people didn’t know the answer to this,” he said. He locked himself in his motel room and was sure he would prove or disprove the conjecture before coming out. “Fifty years or so later I still didn’t know the answer,” he said.
So as for framing GCI, whodunit? Royen ascribes it to the 1972 paper which is probably what popularized it to Pitt, but Dunn’s orthogonal-intervals formulation spurred the intervening work, accommodates extensions noted as equivalent to GCI by Royen citing this 1998 paper, and still didn’t get solved until Royen. So we find these two sources equally “guilty.”
The 1972 form of GCI has a neatly compact statement and visualization:
For any symmetric convex sets in and any Gaussian measure on centered at the origin,
That is, imagine overlapping shapes symmetric about the origin in some Euclidean space. Throw darts that land with a Gaussian distribution around the origin. The claim is that the probability that the a dart will land on both shapes is at least the probability that it will land in one shape times the probability that it lands in the other shape.
UK Daily Mail source |
George Lowther, in his blog “Almost Sure,” has an interesting post about early attempts to solve GCI. He notes the following partial results from the above-mentioned 1998 paper:
The first statement proves GCI in a “shrunken” sense, while the second makes that seem tantamount to solving the whole thing. Lowther explained, however:
Unfortunately, the constant in the first statement is , which is strictly less than one, so the second statement cannot be applied. Furthermore, it does not appear that the proof can be improved to increase to one. Alternatively, we could try improving the second statement to only require the sets to be contained in the ball of radius for some but, again, it does not seem that the proof can be extended in this way.
Royen did not use this idea—indeed, Wolchover quotes Pitt as saying, “what Royen did was kind of diametrically opposed to what I had in mind.” Instead she explains how Royen used a kind of smoothing between the original matrix and (with off-diagonal entries zeroed out as above) as a quantity varies from to , taking derivatives with respect to . For this he had tools involving transforms and other tricks at hand:
“He had formulas that enabled him to pull off his magic,” Pitt said. “And I didn’t have the formulas.”
Royen’s short paper does need the background of these tricks to follow, and the fact that the same tricks enabled a further generalization of GCI makes it harder. The proof was made more self-contained in this 2015 paper by Rafał Latała and Dariusz Matlak (final version) and in a 2016 project by Tianyu Zhou and Shuyang Shen at the University of Toronto, both focusing just on GCI and cases closest to Dunn’s papers. Rather than go into proof details here, we’ll say more about the wider context.
Independent events are usually the best type of events to work with. Recall if and are independent events then,
Of course actually more is true: . But we focus on the inequality, since it can hold when and are not independent. In general without some assumption on the events and the above inequality is not true: Consider the event fair coin is heads and that it is tails. Then becomes .
Since independence is not always true for two events it is of great value to know when is still true. Even an approximation is of great value. Note, a simple case where it still is true is when , then the inequality is trivial, .
GCI reminds us of another inequality that intuitively cuts very fine and was difficult to prove: the FKG inequality. Ron Graham wrote a survey of FKG that begins with a discussion of Chebyshev’s sum inequality, named after the famous Pafnuty Chebyshev.
Chebychev’s sum inequality states that if
and
then
Wikipedia’s FKG article says how the relevance expands to other inequalities:
Informally, [FKG] says that in many random systems, increasing events are positively correlated, while an increasing and a decreasing event are negatively correlated.
An earlier version, for the special case of i.i.d. variables, … is due to Theodore Edward Harris (1960) … One generalization of the FKG inequality is the Holley inequality (1974) below, and an even further generalization is the Ahlswede-Daykin “four functions” theorem (1978). Furthermore, it has the same conclusion as the Griffiths inequalities, but the hypotheses are different.
We wonder whether the new results on GCI will spur an over-arching appreciation of all these inequalities involving correlated variables. We also wonder if in the complex case there is any connection between Royen’s smoothing technique and the process of purifying a mixed quantum state.
The amazing personal fact is that a retired mathematician solved the problem and did it with a relatively simple proof. What does this say about our core conjectures in theory? I am near retirement from Georgia Tech—does that mean I will solve some major open problem? Hmmmmmmm.
Also, which of you have had key insights come in the bathroom?
[nonsingular R–>positive definite R, other tweaks]
Boaz Barak and Michael Mitzenmacher are well known for many great results. They are currently working not on a theory paper, but on a joint “experiment” called Theory Fest.
Today Ken and I want to discuss their upcoming experiment and spur you to consider attending it.
There are many pros and some cons in attending the new Theory Fest this June 19-23. One pro is where it is being held—Montreal—and another is the great collection of papers that will appear at the STOC 2017 part of the Fest. But the main ‘pro’ is that Boaz and Mike plan on doing some special events to make the Fest more than just a usual conference on theory.
The main ‘con’ is that you need to register soon here, so do not forget to do that.
We humbly offer some suggestions to spice up the week:
A Bug-a-thon: Many conferences have hack-a-thons these days. A theory version could be a P=NP debugging contest. Prior to the Fest anyone claiming to have solved P vs NP must submit a paper along with a $100 fee– -Canadian. At the Fest teams of “debuggers” would get the papers and have a fixed time—say three hours—to find a bug in as many papers as they can. The team that debugs the most claims wins the entrance fees.
Note that submissions can be “stealth”—you know your paper is wrong, but the bugs are very hard to find.
Present a Paper: People submit a deck for a ten minute talk. Then randomly each is assigned a deck and they must give a talk based only on the deck. There will be an audience vote and the best presenter will win a trophy.
Note there are two theory issues. The random assignment must be random but fixed-point free—-no one can get their own deck. Also since going last seems to give an unfair advantage, we suggest that each person gets the deck only ten minutes before their talk. Thus all presenters would have the same time to prepare for their talk.
Silent Auction For Co-authorship: We will set up a series of tables. On each table is a one page abstract of a paper. You get to bid as in a standard silent auction. The winner at each table becomes a co-author and pays their bid to STOC. The money could go to a student travel fund.
The A vs B Debate: Theory is divide into A and B at least in many conferences. We will put together a blue ribbon panel and have them discuss: Is A more important than B? We will ask that the panel be as snippy as possible—a great evening idea while all drink some free beer.
Betting: We will have a variety of topics from P=NP to quantum computation where various bets can be made.
Cantal Complexity: The Fest will mark the 40th anniversary of Donald Knuth’s famous paper, “The Complexity of Songs.” Evening sessions at a pub will provide unprecedented opportunity for applied research in this core area. Ken’s research, which he began with Dexter Kozen and others at the ICALP 1982 musicfest, eventually led to this.
Lemmas For Sale: In an Ebay-like manner a lemma can be sold. We all have small insights that we will never publish, but they might be useful for others.
Zoo Excursion: This is not to the Montreal zoo—which is rather far—but to the Complexity Zoo which is housed elsewhere in Canada. Participants will take a virtual tour of all 535 classes. The prize for “collapsing” any two of them will be an instant STOC 2017 publication. In case of collapsing more than two, or actually finding a new separation of any pair of them, see under “Bug-a-thon” above.
Write It Up: This is a service-oriented activity. Many results never have been written up formally and submitted to journals. Often the reason is that the author(s) are busy with new research. This would be a list of such papers and an attempt to get students or others to write up the paper. This has actually happen many times already in an informal manner. So organizing it might be fun. We could use money to get people to sign up—or give a free registration to next years conference— for example.
GLL plans on gavel-to-gavel coverage of the Fest: we hope to have helpers that will allow us to make at least one post per day about the Fest. Anyone interested in being a helper should contact us here.
This will be especially appreciated because Ken will be traveling to a different conference in a voivodeship that abuts an oblast and two voblasts.
]]>
It takes a …
Sir Tim Berners-Lee is the latest winner of the ACM Turing Award. He was cited for “inventing the World Wide Web (WWW), the first web browser, and the fundamental protocols and algorithms allowing the web to scale.”
Today we congratulate Sir Tim on his award and review the work by which the Web flew out and floated wide.
Ken is the lead writer on this, and I (Dick) am just making a few small additions and changes: the phrase “flew out and floated wide” is due to Ken. He was until a while ago trapped in the real world where physical travel is still required. More exactly he was trapped at JFK airport in New York, which many consider not the best airport to be stuck at. The WWW may be wonderful for work at a distance, but we sometimes have to get from here to there: in Ken’s case it’s from Buffalo USA to Madrid Spain for meetings on chess cheating.
While he was stuck at JFK he had the pleasure of using the free airport WiFi. Free is generally good but it yields messages on his browser window that say, “Waiting for response from…” Ken adds:
OK, that window has my Yahoo! fantasy baseball team—I’ll use Verizon 4G access on my cellphone if it doesn’t load. Happily I can write these words offline until Dick and I can resume collaborating on this post when I’m airborne.
Let’s wish Ken good travels out of JFK, and get back to Ken’s thoughts on this Turing Award.
Some have already noted that others besides Berners-Lee were integral to the early days of the Web: his partners at CERN including Robert Cailliau and also Marc Andreessen who wrote the Mosaic browser with Eric Bina and founded Netscape with Jim Clark. Usually we lean toward the “It Takes a Village” view of multiple credits. Last year we addressed whether Ralph Merkle should have been included in the Turing Award with Whitfield Diffie and Martin Hellman. There we signaled our feeling by including Merkle in the first line and photo.
Here, however, we note first that Berners-Lee not only conceived a flexible architecture for the Web, he originated a trifecta: the HTTP protocol, the HTML language, and the first browser design. The protocol included the specification for URLs—Uniform Resource Locators. Of course he had partners on these designs and tools, including counterparts involved in negotiating their adoption, but that brings us to our second point.
This is that Berners-Lee projected his will that the Web be open and free. Its core layers should be free of patent and copyright attachments. Service and access should in first principle be equal everywhere. He convinced many others to share and implement these aspects of will.
Often an idea is invented multiple times. Often ideas remain just ideas because the technology is not ready to implement them. It’s curious to note that Berners-Lee, Steve Jobs, and Bill Gates share the same birth year: 1955. One wonders did this help make things happen? Was there some confluence that made the ideas for the WWW all come together?
Ken notes that “As We May Think” is a 1945 essay by Vannevar Bush. Many of the ideas expressed by Bush are basic to the current WWW. Of course it was written ten years before Berners-Lee was even born, so it is not surprising that Bush did not invent the web.
Quoting Wikipedia:
“As We May Think” predicted (to some extent) many kinds of technology invented after its publication, including hypertext, personal computers, the Internet, the World Wide Web, speech recognition, and online encyclopedias such as Wikipedia: “Wholly new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.” Bush envisioned the ability to retrieve several articles or pictures on one screen, with the possibility of writing comments that could be stored and recalled together. He believed people would create links between related articles, thus mapping the thought process and path of each user and saving it for others to experience. Wikipedia is one example of how this vision has been realized, allowing users to link words to other related topics, while browser user history maps the trails of the various possible paths of interaction.
We applaud ACM for selecting this year’s winner—Sir Tim Berners-Lee. There are perhaps too many great researchers around for all to get the recognition they deserve. Oh well. In any event Dick and I thank Berners-Lee and everyone who maked the WWW possible. Even as I (Ken) wait at JFK I can still help get this writing done, and can interact with Dick. Thanks to all who continue to make this work so well.
]]>
Could we go the way of telegraph operators?
Pixabay source |
Lofa Polir has sent us some new information that will have widespread ramifications for math and theory and science in general.
Today Ken and I wish to comment on this information.
Polir is sure that this information is correct. If he is correct the consequences for all will be immense.
His information is based on recent work of Sebastian Thrun—one of the world’s experts in machine learning. This week’s New Yorker has a featured article in which Thrun’s work on replacing doctors who diagnose skin diseases is presented. The article describes him thus:
Thrun, who grew up in Germany, is lean, with a shaved head and an air of comic exuberance; he looks like some fantastical fusion of Michel Foucault and Mr. Bean. Formerly a professor at Stanford, where he directed the Artificial Intelligence Lab, Thrun had gone off to start Google X, directing work on self-learning robots and driverless cars.
Thrun’s work is really interesting, and he has stated that medical schools should stop teaching doctors to read X-rays and other images, since robotic systems will soon be better at this. His system for skin images already beats expert doctors at detecting abnormal growths.
But this project along with his others is a smokescreen for his most important project, claims Polir. Thrun has put together a double-secret project that has been running for over five years. The project’s goal is: the automation of math and other sciences. Thrun predicts—well, let’s take a look at what he is doing first.
Thrun’s project is to use machine learning methods to build a system that can outperform us in doing science of all kinds. It requires huge amounts of data and he has access to that via the web. The strategy is an exact parallel of how Google DeepMind’s AlphaGo program. Quoting our friends on Wikipedia regarding the latter:
The system’s neural networks were initially bootstrapped from human gameplay expertise. AlphaGo was initially trained to mimic human play by attempting to match the moves of expert players from recorded historical games, using a database of around 30 million moves. Once it had reached a certain degree of proficiency, it was trained further by being set to play large numbers of games against other instances of itself, using reinforcement learning to improve its play.
In place of reading and digesting master games of Go, Thrun’s system reads and digests scientific papers. The ability to have his algorithm “read” all papers in science is the secret:
Thrun points out that mathematicians in their lifetime may read and understand thousands of papers, but his system is capable of understanding millions of papers.
This ability is one of the reasons his algorithm will outperform us. Another is it can use immense computational power 24/7. It never needs to sleep or rest. Polir claims that Google has made an entire secret data center of over a billion CPU cores available to this project. In a closed agreement with the University of Wisconsin, the center is housed in the new IceCube neutrino observatory. Polir justifies revealing this on grounds it should be obvious—would they really hew out a cubic kilometer of ice in Antarctica just to observe neutrinos and ignore the cooling cost benefits of placing a huge processing center in the cavity?
Old-time theorem provers used lots of axiom and proof rules. This kind of approach can only go yea-far. Homotopy type theory, which tries out a more topological approach, provided part of the inspiration to Thrun that he could find a better way. Another part was Roger Penrose’s argument that humans are less blocked by Kurt Gödel’s Incompleteness Theorems than logic-based systems are. So Thrun was spurred to start by making his machine learn from humans, much like AlphaGo.
In the New Yorker article—with extra information gleaned by Polir—Thrun describes the situation this way:
“Imagine an old-fashioned program to identify a dog,” he said. “A software engineer would write a thousand if-then-else statements: if it has ears, and a snout, and has hair, and is not a rat . . . and so forth, ad infinitum. But that’s not how a child learns to identify a dog, of course.” Logic-based proof systems work the same way, but that’s not really how we go about identifying a proof. Who checks modus ponens on every line? “The machine-learning algorithm, like the child, pulls information from a training set that has been classified. Here’s a dog, and here’s not a dog. It then extracts features from one set versus another.” Or like a grad student it learns: here’s a proof, and here’s not a proof. And, by testing itself against hundreds and thousands of theorems and proofs, it begins to create its own way to recognize a proof—again, the way a grad student does. It just knows how to do it.
Polir confirmed that Thrun’s machine first runs the papers through the kind of “Lint”‘-like module we posted about. This is not only a data-cleaning step but primes the reinforcement learning module on the mathematical and scientific content.
Then comes a Monte Carlo phase in which the system randomly generates alternative proofs of lemmas in the papers and scores the proofs for economy and clarity. This completes the automated paper-rewriting level of their service, which is under negotiations with Springer-Verlag and Elsevier and other academic publishers for deals that may assure steady funding of the larger project. Finally, the results of these runs are input into the deep-learning stack, which infers the kinds of moves that are most likely to lead to correct proofs and profitable discoveries.
One of the predictions that Thrun makes is that like with doctors we may need to start thinking about training students to get PhDs in math soon. He goes on to raise the idea that the machine will make such basic discoveries that it will win Nobel Prizes in the future.
The results of Thrun’s project are so far secret, and it is likely that he will deny that it is happening right now. But Polir found out one example of what has been accomplished already.
Particle physics of the Standard Model uses quite a few elementary particles. See this for a discussion.
These 31 elementary particles are the most fundamental constituents of the universe. They are not, as far as we know, made up of other particles. The proton, for example, is not an elementary particle, because it is made up of three quarks, whereas the electron is an elementary particle, because it seems to have no internal structure.
Although the Standard Model has worked impeccably in practice, it has higher complexity than physicists have expected from a bedrock theory of nature. The complexity comes from the large number of particles and the large number of constants that the model cannot predict.
A cluster of Thrun’s dedicated machines has already found a new model that reduces the number of elementary particles from 31 to 7. The code name for the cluster and its model, in homage to AlphaGo, is AlphaO. The AlphaO model is claimed to still make all the same predictions as the standard one, but the reduction in undetermined constants could be immensely important.
Is Polir fooling? He may be and not be at the same time. If you had told us a year-plus ago that AlphaGo would wipe out the world’s best Go players 60-0 in online semi-rapid games, we would have cried fool. The AlphaGo project is an example of a machine coming from nowhere to become the best in a game that people thought was beyond the ability of machines. Could it be soon the same with AlphaO? We will see.
]]>