Cropped and combined from src1, src2. |
Michaël Rao and Marjorie Rice are linked in this month’s news. Rao has just released a paper (see also slides and code here) completing the catalog of convex polygons that tile the plane. Rice, who passed away on July 2 (obit), had expanded the pentagon catalog from 9 to 13 while working in her kitchen in 1975. Rolf Stein found a fourteenth in 1985 and Casey Mann led a team of programmers to find a fifteenth in 2015. Rao has closed the book at 15.
Today Dick and I hail their accomplishments, which we noted from two articles by Natalie Wolchover in Quanta this past Tuesday. We also emphasize some related problems.
We especially are impressed by Rice, who was a true amateur. She had no advanced training in mathematics of any kind. After reading a 1975 Scientific American article on tessellations she started her search for new types. She succeeded and found ones that had been missed by everyone—that includes Johannes Kepler who worked on tessellations in 1691. She maintained a website named, “Intriguing Tessellations.”
In recent decades, computers have become an essential tool. This raises the possibility of a new kind of amateur: one who can code. Computing power is more accessible than ever before. The fact that having advanced degrees doesn’t make your code run faster levels the playing field. As it happens, Mann led a team that included a student and Rao wrote his own code.
If you draw any triangle in the plane, then you can place a 180 rotated copy against it on an edge to make a parallelogram. That can be replicated to make an infinite strip, and those strips complete a tiling of the plane. That tiling is periodic with only two different orientations of the triangle.
A little more thought will tell you that any quadrilateral—not just a parallelogram—can be made to tile the plane. The reason is that the four interior angles add up to 360—even if the quadrilateral is not convex. Make three copies and orient them so that the four different angles come together at a corner of the original and its two adjoining edges are shared. Then the same orientations work at the opposite corner and this suffices to see that the clump of four tiles the whole plane, indeed two of them do.
Combined from Math and the Art of M. C. Escher wiki source. |
Convex polygons with 7 or more sides cannot tile the plane—whatever their shape—because their interior angles sum so high that the average number of polygons meeting at a vertex would fall below 3. Regular hexagons can tile, of course. Karl Reinhardt in 1918 showed that convex hexagons can tile in three ways that use 2, 3, or 4 different orientations (the first not being a special case of the third).
That left the case of pentagons. Of course a regular pentagon cannot tile the plane, but ones shaped like a baseball home plate can mesh in a sawtooth pattern using two orientations. You can get this by cutting each strip of a regular hexagonal tiling in half:
Mathematical Tourist source. |
The idea of cutting tiles to make new ones animates the mathematics. The following figure taken from a 2015 story in Britain’s Guardian newspaper shows the now-complete list of distinct pentagonal tilings.
Note that the version of this diagram used in a February 2013 post on Rice had only the 14 tilings known then.
All known tilings by single connected pieces are periodic like wallpaper. There is an algebraic theory of wallpaper symmetries and corresponding groups. Note that pentagonal symmetry is excluded. The periodic clumps in tilings by pentagons, if they have any rotational symmetry other than the full circle, must have one of order 2, 3, 4, or 6.
Flooring, however, can choose to have a radial pattern. Here are two tilings found by Sir Roger Penrose of Oxford with five-fold symmetry that can be extended infinitely far:
Modified from Martin Gardner article source. |
Both use the same two quadrilateral tiles and a special restriction: the two centrally symmetric vertices of one cannot touch a centrally symmetric vertex of the other. Cutting each tile into two triangles facilitates defining a self-recursion that proves how the pattern can be extended infinitely. Our diagram also shows a mutual recursion between the “sun” and “star” patterns.
The limit ratio of the convex “kite” tiles to the concave “dart” tiles in the recursions is the golden ratio, … To see why, note that each larger orange-bordered kite at right is made of two kites plus two halves of a dart, while the larger darts have just one kite and the halves of a dart. The recursion thus involves powers of the matrix , whose entries yield consecutive Fibonacci numbers, whose ratio approaches . Because is irrational, the tilings are not periodic.
Penrose found a related tiling using two convex shapes—a thin lozenge and a fatter one—with a similar restriction that enforces aperiodicity. John Conway suggested enforcing these restrictions by matching up colored lines, as illustrated by the entrance to Oxford’s new Mathematical Institute building. This is my own photo from two years ago:
The restriction can be enforced without markings by notching the sides allowed to match up in the manner of jigsaw puzzle pieces, but this creates non-convex polygons. Robert Ammann found a way to cut Penrose’s lozenges and assemble them into three convex polygons that can only tile aperiodically. This figure from a paper last year by Teruhisa Sugimoto shows how:
Here is a color version of Ammann’s tiling posted by John Lindner. It too could be an attractive floor, but how about as a kitchen counter?
Two of Ammann’s tiles are pentagonal. Can a single pentagon carry out an aperiodic tiling? That question may have boarded the train of Rice’s thinking as she worked at her kitchen counter. It took until Sugimoto’s paper to prove this impossible when the pentagons must share entire edges. Rao’s results completes a definite no answer.
Understanding tilings of the plane is an intriguing mathematical problem. Finding new ones, as Rice did, requires cleverness and insight. Showing that certain types of tilings are impossible, as Rao did, requires another type of cleverness: the ability to prove that something cannot exist. This is interesting because it reminds us of lower bound problems that are well documented to be difficult.
Tilings are special compared to other classification problems—that is problems that show that the following list consist of all ways to create some mathematical object. They are different because tilings can be used to build real objects. One measure of this is that there are a number of patents for various types of tilings. Penrose thought to patent his tiles before publicizing them. We quote the introduction to his US patent 4133152 titled, “Set of tiles for covering a surface”:
[The field of this invention] has found practical application not only to the design of paving and wall-coverings but also in the production of toys and games. In both instances, not only is the purely geometric aspect of complete covering of the surface of importance, but the esthetic appeal of the completed tessellation has equal significance in the eye of the beholder. … [T]he pattern which they form is necessarily non-repetitive, giving a considerable esthetic appeal to the eye.
We especially like his equal regard for esthetics. What emerged in greater force than even he may have imagined—while expressly thinking of crystals—is how strongly Nature shares this regard. Dan Shechtman received the 2011 Nobel Prize in Chemistry for discovering quasicrystals. We covered some of this history going back to Hao Wang’s first proof of the existence of finite sets of non-convex tiles that can only tessellate aperiodically in a post four years ago. It is also neat that this was initially a consequence of Wang’s proof that whether a given can tessellate is undecidable, because by compactness, if the only tessellations are periodic then this fact is detectable in finite time.
If you don’t insist on convex tiles for your kitchen counter, then the question of aperiodic tilings by one piece remains open. This is called the Einsteinproblem. The name is not for Albert Einstein but derives from ein Stein being German for one stone. German uses “Stein” in many game contexts (besides Go) where we in English say “piece.”
Joan Taylor of Tasmania, another amateur mathematician, discovered in early 2010 that a single hexagonal tile could be forced to tile hierarchically—and only aperiodically—if a more complicated set of marking rules is stipulated. These rules cannot be enforced by jigsaw notching. However, followup work with Joshua Socolar discovered how to realize the hierarchical scheme by a single non-connected tile:
The three lines are just for show; the shape alone enforces the structure which the lines make clear. This is not quite an einstein—not one stone—but an allusion to Albert is warranted by the combination of cleverness, esthetics, and amazement that this fact brings. Taylor maintains a website with other striking designs.
There is also the problem of proving the widely-voiced belief that for square tiles with notches, the set of six discovered by Raphael Robinson has the minimum size to force aperiodicity.
If you prefer to stay with convex tilings, what emerges from Sugimoto’s paper vis-à-vis Ammann’s tiles discussed above is that the following question remains open:
Are there two convex tiles that tessellate but only aperiodically?
We have not even taken time to consider tilings in three (or higher) dimensions, in which a single bi-prism is known to give only tight packings of space that are aperiodic in one of their three dimensions. Whether a single 3D tile can squeeze periodicity out of all three dimensions seems to be open. We have also glommed over whether to allow tiles to be reflected or flipped over as well as rotated, and whether the number of different rotation angles in an aperiodic tiling is infinite, as happens for the irrational twist angles (in degree units) for the bi-prism packings. We invite you, our readers, to contribute your own favorite open tiling problems.
How would you “bet” on the open tiling problems? How would you have bet before 2010? We’ve discussed estimates of how people would bet on open problems in complexity but we have no idea here.
[sourced first two diagrams in section 2]
Combined from source |
Eric Allender and Michael Saks have been leading lights in computing theory for four decades. They have both turned 60 this year. I greatly enjoyed the commemorative workshop held in their honor last January 28–29 at DIMACS on the Rutgers campus.
Today Dick and I salute Eric and Mike on this occasion.
Eric and Mike have been together on the Rutgers faculty since the middle 1980s. I have known Eric since we were both graduate students. We both had papers at the first Structure in Complexity Theory conference in 1986—when it was co-located with STOC at Berkeley—and again at the 1986 ICALP in Rennes, France. I don’t know if I first met Mike at the Berkeley conferences or a couple years later at FOCS 1988 in White Plains, New York. The “Structures” conference was renamed CCC for “Computational Complexity Conference” in 1996. The 2017 conference starts tomorrow in Riga, Latvia.
Mike also has recently been named an ACM Fellow, joining Eric on that illustrious roster. Mike’s primary appointment is in Mathematics and he is currently serving as Chair. Eric was Chair of Computer Science at Rutgers not long ago. I have somehow managed not to do a paper with Eric—nor has Lance Fortnow, as Lance noted during his talk—but we collaborated with Michael Loui on three book chapters covering complexity theory in the CRC Algorithms and Computation Theory Handbook. I did write a paper with Eric’s student Martin Strauss, who along with Michal Koucký—another of Eric’s students—organized the workshop.
One hears all the time about movements and generations in art and music but they happen also in science. We’ve discussed the “AI Winter” among other things. Complexity theory is no exception.
Eric and I were among the avant-garde of “Structural Complexity.” The idea was to understand common features of complexity classes and reductions between problems, apart from specific features of the problems in isolation. In part this was a reaction to how direct analysis of problems had not only failed to resolve versus in the 1970s, but had met barriers in the form of oracle results that applied in similar ways across many levels of classes.
Eric’s paper at STOC 1987, titled “Some Consequences of the Existence of Pseudorandom Generators,” kicked off his study and use of multiple forms of Kolmogorov Complexity (KC). We have covered this in Eric’s research before. The “structural” flavor of KC comes from how it avoids referencing specific combinatorial structures like graphs or formulas or set systems and how it can be applied at many levels of complexity.
It seems fair to say the structural approach clarified and systematized many questions of complexity theory but did not resolve many on its own. But it was great for fashioning molds into which combinatorial arguments can be injected. Eric’s signature lower bound with his student Vivek Gore, separating the classes uniform- and , employs these ‘structural’ ingredients:
A similar list can be made for Ryan Williams’s separation of from nonuniform : oracles again employed constructively; succinctness; quasi-linear time complexity as a stepping-stone; and probabilistic simulation of AND/OR using modular counting—for which Allender-Gore is cited. The use of symmetric functions and gates in both papers might be deemed more “combinatorial” but this still goes toward my point here.
That said, Mike has always represented more the “combinatorial” side—indeed, his first three journal papers after joining Rutgers appeared in the journal Combinatorica. Further, the workshop emblem adds up to “Eric + Mike = Complexity and Combinatorics.” The two flavors were evident in the series of talks chosen to honor each and both.
All but one talk has video online. DIMACS Director Rebecca Wright gave a welcoming introduction.
Avi Wigderson spoke on “Branscamp-Lieb Inequalities, Operator Scaling, and Optimization.” After relating why his and Mike’s families are close, he set the tone by telling what led Mike into combinatorics as a graduate student in mathematics:
“When he was near the algebraists, the typical conversation he would hear is, ‘Remember this really extremely general result I proved last year? Well, guess what—I can now generalize it even more and I can prove an even more general one.’ [But] then when he hung around with the combinatorialists, he would hear, ‘You remember this extremely trivial problem that I could not solve last year? Well, I found a special case that I can still not solve.’ So he decided, ‘that’s for me.’
Well … Mike still has some affinity to the algebraic side, … and has this tendency whenever he is facing a problem the first thing to do is to generalize it just below the point where it becomes false, and then scale it a bit.”
Avi then introduced the technical part by saying that he was led into it by the Polynomial Identity Testing (PIT) problem, “a problem that Eric cares about, Mike cares about, lots of people care about”:
“I just want to mention that Mike and I spent five years on [PIT], meeting every week in a café for the day. We had lots of great ideas that ended up with nothing. I think that’s the story with a lot of other people.”
Avi could have appended to that last sentence, “…and in complexity theory in general.” Thus C & C are married to a hard bed. The main body of his talk was about testing inequalities, where things can be done.
Harry Buhrman spoke on “Computing With Nearly Full Space.” We covered this work with a different slant here. Harry’s first six minutes featured many stories and photos of conferences and meetings with Eric and Mike.
Meena Mahajan spoke on “Enumerator Polynomials: Completeness and Intermediate Complexity.” Although she began by saying she mainly knew Eric, having invited him to India and vice-versa, her talk was highly combinatorial involving polynomial enumerators for cliques, Hamiltonian circuits, and much else including projections in real space.
Clifford Smyth spoke on “Restricted Stirling and Lah numbers and their inverses.” This involved the problem of computing (the signs of) entries of certain inverse matrices without having to do the whole inversion.
Yi Li spoke on “Estimating the Schatten Norm of Matrices in Streaming Models.” He started with a problem about -dimensional real vectors : Starting with , you get sequential updates += to the components of . You want to maintain estimates of a function to within without using space —ideally, using space polylog in . He then took this to the case of matrices and described solvable and hard cases.
Mary Wootters—who along with Yi Li did her PhD under Martin—closed the first day by speaking on “Repairing Reed-Solomon Codes.” This, from a joint paper with Venkat Guruswamy, was my favorite talk. The basic problem is deliciously simple: Given an unknown polynomial of degree over where , and an argument , we want to compute . A random set of values suffices to compute any by interpolation. Having only values is never enough. Each value has bits. Do we really need all bits of the values? Mary gave cases where, amazingly, getting samples of only total bits from the values is enough. The bits sent by each node may be computed locally from but not with communication from any other node.
The conference dinner had several speeches and toasts and a joint birthday cake.
Neil Immerman spoke on “Algebra, Logic and Complexity.” He began by noting that he met Eric at the same joint STOC and Structures meeting in 1986 which I mentioned above. He started with how the descriptive complexity program refined notions of reductions to make them very sharp, culminating with uniform- reductions formalized via first-order logic. This covered his 2009 paper with Eric and three others, showing that under standard complexity assumptions there are exactly six equivalence classes of Boolean-based constraint satisfaction problems under isomorphisms.
Nati Linial spoke on “Hypertrees.” He has been Mike’s most frequent collaboration partner—21 joint papers and counting—and vice-versa. He related how they have visited each other often since they were post-docs together at UCLA. The talk involved large matrices whose nonzero entries are or .
Toniann Pitassi was supposed to speak on “Strongly Exponential Lower Bounds for Monotone Computation” but she was unable to travel at the last minute.
Ramamohan Paturi spoke about “Satisfiability Algorithms and Circuit Lower Bounds.” This covered Mike’s algorithmic ideas in the famous “PPSZ” paper, “An Improved Exponential-Time Algorithm for -SAT.”
Lance Fortnow closed the technical part of the meeting by talking about “Connecting Randomness and Complexity.” He started by noting that he and Eric have gone a combined 61-for-62 in attending the Structures/Complexity conferences, Lance having missed only 2012 in Porto, and told more personal stories. His talk covered Eric’s work involving degrees of Kolmogorov complexity-based randomness, which I’ve noted above.
I had to drive back to Buffalo early the next day, so I missed the festivities on the second evening and a third-day brunch at Eric’s house. Overall it was a really nice and convivial time. It was great seeing friends again, and one conversation in particular has proved valuable to me since then: I heard Eric and Harry and Michal and Jack Lutz and perhaps Mario Szegedy or Mohan Paturi talking about how Kolmogorov complexity is “not so concrete.” Without giving the actual details I outlined a practical case where one would want a definite, concrete measure. I thank DIMACS for sponsorship and the organizers for putting together a great event.
I’ve just now returned from Poland, whose classic toast “Sto lat!” means, “May you live one hundred years.” Accordingly we wish that may come to denote their ages in Roman numerals.
[fixed two names and added note about Li being Martin’s student too]
Oded Goldreich is one of the top researchers in cryptography, randomness, and complexity theory.
Today Ken and I wish to thank the Knuth Prize Committee for selecting Oded as the winner of the 2017 Knuth Prize.
It is no doubt a wonderful choice, a choice that rewards many great results, and a choice that is terrific. Congrads to Oded. This year the choice was only announced to the general public at the last minute. Ken and I at GLL got an encrypted message that allowed us to figure it out ahead of time. The message was: YXWX APRN LKW CRTLK DHPFW The encryption method is based a code with over
keys, and so was almost unbreakable. But we did it.
Oded gave his talk last night to a filled ballroom: one of the perks of winning the Knuth Prize. I had sent him congrats as soon as I heard he had won and added I looked forward to his talk. He answered essentially “thanks for increasing the pressure on me.” I know he was kidding since he always gives great talks.
I just heard the talk; and he delivered, with his usual mixture of fun and seriousness. The talk had two parts. The first started with some apologizes.
He added some wonderful comments like: “I had some jokes but I forgot them.” This brought a huge laugh down—we theory people just love diagonal arguments.
This part continued with some interesting comments on the nature of Theory. Some of it was advice to junior members and some advice to senior members. My favorite were:
I like these suggestions very much. I have more than once been on the receiving end of “but it is so simple.” I would like to think that I rarely have said that to someone else.
Oded then moved on to the technical part of his talk. I personally liked the first part very much and would have loved to hear more of his comments of this nature.
But Oded wanted to use this talk to also highlight some very interesting new results on proof systems. Here he spoke about On Doubly-Efficient Interactive Proof Systems. He introduced the idea by using the movie When Night is Falling. It is a Canadian Film from 1995 involving Petra and Camille. My wife, Kathryn Farley, who was sitting next to during the talk, immediately whispered to me: “what a wonderful movie” as soon as Oded put a picture of Petra and Camille on the screen. We all have our own expertise.
A proof system is called doubly-efficient if the prescribed prover strategy can be implemented in polynomial-time and the verifier’s strategy can be implemented in almost-linear-time. See here for a paper on the subject joint with Guy Rothblum. I think we will report on this material in more detail in the future, but here is part of their abstract:
A proof system is called doubly-efficient if the prescribed prover strategy can be implemented in polynomial-time and the verifier’s strategy can be implemented in almost-linear-time. We present direct constructions of doubly-efficient interactive proof systems for problems in P that are believed to have relatively high complexity. Specifically, such constructions are presented for t-CLIQUE and t-SUM. In addition, we present a generic construction of such proof systems for a natural class that contains both problems and is in NC (and also in SC).
Again congrats to Oded. Any thoughts of how the message to Ken and me was encoded?
[some word fixes]
Géraud Sénizergues proved in 1997 that equivalence of deterministic pushdown automata (DPDAs) is decidable. Solving this decades-open problem won him the 2002 Gödel Prize.
Today Ken and I want to ponder how theory of computing (TOC) has changed over the years and where it is headed.
Of course we have some idea of how it has changed over the years, since we both have worked in TOC for decades, but the future is a bit more difficult to tell. Actually the future is also safer: people may feel left out and disagree about the past, but the future is yet to happen so who could be left out?
For example, we might represent the past by the following table of basic decision problems involving automata such as one might teach in an intro theory course. The result by Sénizergues filled in what had been the last unknown box:
Problem/machine | DFA | NFA | DPDA | NPDA | DLBA | DTM | |||||||
Does accept ? | In P | In P | In P | In P | PSPC | Undec. | |||||||
Is ? | In P | In P | In P | In P | Undec. | Undec. | |||||||
Is ? | In P | In P | Undec. | Undec. | Undec. | Undec. | |||||||
Is ? | In P | PSPC | In P | Undec. | Undec. | Undec. | |||||||
Is ? | In P | PSPC | Decidable | Undec. | Undec. | Undec. | |||||||
Here ‘PSPC’ means -complete. This table is central but leaves out whole fields of important theory.
At the Theory Fest this June—which we mentioned here—there will be a panel on the future of TOC. We will try to guess what they will say.
Of course we don’t know what the panel will say. They don’t necessarily give statements ahead of time like in some Senate hearings. But we can get a hint from the subjects and titles of some of the invited plenary talks, which are the last afternoon session each day:
We salute Ken’s colleague Atri among the speakers. There is also a keynote by Orna Kupferman titled, “Examining classical graph-theory problems from the viewpoint of formal-verification methods.” And there is one by Avi Widerson titled “On the Nature and Future of ToC”—which is the subject of this post.
We can get a fix on the present by looking at the regular papers in the conference program. But like Avi we want to try to gauge the future. One clear message of the above range of talks is that it will be diverse. But to say more about how theory is changing we take another look at the past.
We can divide the changes in TOC into two parts. One is
and the other is
Years ago, most of the questions we considered were basic questions about strings and other fundamental objects of computing. A classic example was one of Zeke Zalcstein’s, my mentor’s, favorite problems: the star height problem. You probably do not know this—I knew it once and still had to look it up. Here is a definition:
Lawrence Eggan seems to have been the first to raise the following questions formally, in 1963:
Regarding the first question, at first it wasn’t even known whether needed to be greater than . There are contexts in which one level of nesting suffices, most notably the theorem that one while-loop suffices for any program. Eggan proved however that is unbounded, and in 1966, Fran\c{c}ois Dejean and Marcel Schützenberger showed this for languages over a binary alphabet.
The second question became a noted open problem until Kosiburo Hashiguchi proved it decidable in 1988. His algorithm was not even elementary—that is, its time was not bounded by any fixed stack of exponentials in —but Daniel Kirsten in 2005 improved it to double exponential space, hence at worst triple exponential time. It is known to be -hard, so we might hope only faintly for a runnable algorithm, but special cases (including ones involving groups that interested Zeke) may be tractable. Narrowing the gap is open and interesting but likely to be difficult.
Do you wish you could travel back to the early 1960s to work on the original problems? Well, basically you can: Just add a complementation operator and define it to leave star-height unchanged. Then the resulting generalized star-height problem is wide open, even regarding whether suffices. To see why it is trickier, note that over the alphabet ,
so those languages have generalized star-height . Whereas, does not—it needs the one star. See this 1992 paper and these recent slides for more.
Diversifying areas are certainly giving us new domains of questions to attack. Often the new problem is an old problem with an new application. For instance, Google’s PageRank algorithm derives from the theory of random walks on graphs, as we noted here.
The novelty we find it most fruitful to realize, however, comes from changes in what we regard as solutions—the second point at the head of the last section. We used to demand exact solutions and measure worst-case complexity. Now we allow various grades of approximation. Answers may be contingent on conjectures. For example, edit distance requires quadratic time unless the Strong Exponential Time Hypothesis is false—but some approximations to it run in nearly linear time. We have talked at length about such contingencies in crypto.
A nice survey in Nature by Ashley Montanaro shows the progression within the limited field of quantum algorithms. In the classical worst-case sense, it is said that there aren’t many quantum algorithms. For a long time the “big three” were the algorithms by Peter Shor and Lov Grover and the ability of quantum computers to simulate quantum -body problems and quantum physics in general. Quantum walks became a fourth and linear algebra a fifth, but as Montanaro notes, the latter needs changing what we consider a solution to a linear system where is . You don’t get , rather you get a quantum state that approximates over a space of qubits. The approximation is good enough to answer some predicates with high probability, such as whether the same solves another system . You lose exactness but what you gain is running time that is polynomial in rather than in . A big gain is that is now allowed to be huge.
The survey goes on to problems with restricted numbers of true qubits, even zero. These problems seem important today because it has been so hard to build real quantum computers with more than a handful of qubits. Beyond the survey there are quantum versions of online algorithms and approximations of those.
If we are willing to change what we consider to be an answer, it follows that we are primed to handle fuzzier questions and goals. Online auctions are a major recent topic, and we have talked about them a little. There are many design goals: fairness, truthfulness, minimizing regret, profitability for one side or the other. Again we note that old classic problems are often best adaptable to the new contexts, such as stable-marriage graph problems with various new types of constraints.
The old classic problems never go away. What may determine how much they are worked on, however, is how well we can modify what counts as a solution or at least some progress. It seems hard to imagine partial or approximate answers to questions such as, “is logspace equal to polynomial time?”
The problem we began with about equivalence of DPDAs may be a good test case. Sénizergues gave a simple yes-answer to a definite question, but as with star-height, his algorithm is completely hopeless. Now (D)PDAs and grammars have become integral to compression schemes and their analysis—see this or this, for instance. Will that lead to important new cases and forms of the classic problems we started with? See also this 2014 paper for PDA problem refinements and algorithms.
What are your senses of the future of ToC?
]]>
The problem of mining text for implications
2016 RSA Conference bio, speech |
Michael Rogers, the head of the National Security Agency, testified before the Senate Intelligence Committee the other day about President Donald Trump. He was jointed by other heads of other intelligence agencies who also testified. Their comments were, as one would expect, widely reported.
In real time, I heard Admiral Rogers’s comments. Then I heard and read the reports about them. I am at best puzzled about what happened.
The various reports all were similar to this:
Adm. Michael S. Rogers, the head of the National Security Agency, also declined to comment on earlier reports that Mr. Trump had asked him to deny possible evidence of collusion between Mr. Trump’s associates and Russian officials. He said he would not discuss specific interactions with the president.
The above quote is accurate—Adm. Rogers did not discuss specific interactions with the president. But I have trouble with this statement. The problem I have is this:
Are statements made in a Senate hearing subject to the basic rules of logic?
For example, if a person says and later says , can we conclude that he or she has effectively said ?
Let’s look at the testimony of Adm Rogers. He insisted that he could not recall being pressured to act inappropriately in his almost three years in the post. “I have never been directed to do anything I believe to be illegal, immoral, unethical or inappropriate,” he said.
During his three years as head of the NSA he worked under President Obama and now President Trump. So I see the following logical argument. Since he has never been asked to do anything wrong during that period, then it follows that Trump never asked him to do anything wrong.
This follows from the rule called universal specification or universal elimination. If is true, then for any in the set it must follow that is true.
What is going on here? The reports that he refused to answer ‘is true?’ are correct. But he said a stronger statement in my mind that is true. Is it misleading reporting? Or do the rules of logic not apply to testimony before Senate committees? Which is a stronger statement:
or
where is an element of ?
In mathematics the latter statement is stronger, but it appears not to be so in the real world. The statement is more direct. What does this say about logic and its role in human discourse?
Ken recalls a course he took in 1979 from the late Manfred Halpern, a professor of politics at Princeton. Titled “Personal and Political Transformation,” the course used a set of notes that became Halpern’s posthumous magnum opus.
The notes asserted that components of human relationships can be classed into eight basic modalities, the first three being paradigms for life: emanation, incoherence, transformation, isolation, subjection, direct bargaining, boundary management, and buffering. The first three form a progression exemplified by Dorothy and the wizard vis-à-vis Glinda and the ruby slippers in The Wizard of Oz; later he added deformation as a ninth mode and fourth paradigm and second progression endpoint. It particularly struck Ken that presenting mathematical proofs is classed as a form of subjection: You can’t argue or bargain with a proof or counterexample.
Buffering made and remained in his list. He showed how each member is archetypal in human history and depth psychology. So Ken’s answer is that the one-step-remove of saying “” rather than “” is a deeply rooted difference. It makes wiggle-room that a jury of peers might credit in a pinch.
Psychology aside, the mining of logical inferences is a major application area. Sometimes the inference is outside the text being analyzed, such as when “chatter” is evaluated to tell how far it may imply terrorist threats. We are interested in cases where the deduction is more inside. For instance, consider this example in a 2016 article on the work of Douglas Lenat:
A bat has wings. Because it has wings, a bat can fly. Because a bat can fly, it can travel from place to place.
One might say that underlying this is the logical rule
One of the problems, however, is that even if we limit the set to animals, the rule is false—there are many flightless birds. This leads into the whole area of non-monotonic logic which is a topic for another day—but good to bear in mind when revelations from hearings revise previously-held beliefs.
Ken has been dealing this week with an example at the juncture of the logic of time and human language. He had to evaluate twenty pages of testimony about a recurring behavior . In one place it states that occurred at time and occurred “once again” at time . The question is whether one can infer and apply the rule
This was complicated by the document having been translated from a foreign language. Whether time was the next occurrence of after time makes a difference to results Ken might give. Of course this may be clarified in a further round of testimony—but we could say the same about Admiral Rogers, and he has left the stand.
How soon will we have apps that can take statements of the form and deduce for a particular that we want to know about? Will inferences from “material implication” be considered material in testimony?
Update, 11:15pm 6/8/17: CNN has just told of a woman they interviewed about the contradictions between Donald Trump and James Comey. Asked if she believes Comey lied, she replied, “No.” Asked if she believes Trump lied, she replied, “No.” Asked how that could be, she said: “The media has distorted it.” Thus the logical law of excluded middle is replaced by a “law of occluded media,” which blocks constructive inference…
Wikimedia Commons source |
Robert Southey was the Poet Laureate of Britain from 1813 until his death in 1843. He published, anonymously, “The Story of the Three Bears” in 1837.
Today Ken and I want to talk about the state of versus and the relationship to this story.
The story, as I’m sure you know, is about Goldilocks. She has—no surprise—curly blond hair. She enters the home of three bears while they are away. She tries their chairs, eats some of their porridge, and falls asleep on one of their beds. When the bears return she runs away.
What you may not know is that Southey’s original story had not a young girl but an old woman. She is not innocent but furtive, self-serving, and meddlesome: she breaks the little bear’s chair and eats his breakfast. An 1849 retelling by Joseph Cundall changed her into a girl named “Silver-hair” and changed her motives to restlessness and curiosity. Her hair changed to gold around 1868 but she did not acquire the name “Goldilocks” until 1904.
Of course there is no change to our classic problem: Claims continue that there are proofs that , claims continue that there are proofs that , and claims continue that —but without offering any proof. What connection can there be to the Goldilocks story? It is in the telling—the literary rule of three augmented with a total order.
The Goldilocks tale is really one of . It has her try at each stage: chairs, food, and beds. At each stage of the story, one item is too big-or-hot-or-hard, one is too small-or-cold-or-soft, and one is just right. Then the bears follow the same sequence in discovering her traipsing.
The “just right” aspect has been named the Goldilocks Principle. Christopher Booker’s oft–quoted description of the “dialectical three” goes as follows:
“[T]he first is wrong in one way, the second in another or opposite way, and only the third, in the middle, is just right. … This idea that the way forward lies in finding an exact middle path between opposites is of extraordinary importance in storytelling.”
The Goldilocks Principle however leads, according to this neat history on the LetterPile website, to what it calls the “Goldilocks Syndrome”:
“We are living in consumerism, where big companies non-stop create billions of realities, where everybody … can feel ‘just right.’ … The problem starts when we can’t stop looking for perfect solutions in [this] pretty imperfect world.”
It is not clear whether they have a solution, but they go on to describe and recommend the following “Goldilocks Rule”:
“Balance between known and unknown, risky and risk-free, predictable and unpredictable.”
Our take on all this is: are we trying to be “too perfect” in our approach to versus ? Can we profitably strike a new balance?
I have argued recently and before that is possible, but with an algorithm for say that is galactic—see here for our introduction of this term—meaning an algorithm that is completely useless. Here are three perspectives on the power of the two classes.
Lemma 1 There is a constant so that if is in , then has a Boolean circuit of size a most .
Note that the consequence is easy to show: Assume and that there is such a constant . Then this contradicts Ravi Kannan’s famous theorem that the polynomial hierarchy has sets that require boolean circuits for any . (See this 2009 paper for more.) In terms of our theme: is too weak to be equal to —the bowl is too small.
We can start by regarding the “three barriers”—relativization, natural proofs, and algebrization—as effects of such masquerading. Then one can focus further on the extent to which -objects can be approximated by polynomial-time ones. Many -complete problems are easy in average case under certain natural distributions. We wonder whether the theory can be structured to say that logspace objects, or ones from (not to mention ) cannot approximate so well. An example of a technical issue to overcome is that languages like are -hard but approximated ultra-simply by the language of all strings.
Of course, it would be a huge breakthrough already to separate from uniform , let alone from logspace or . We’re suggesting instead to think along these lines:
So, which bowl? Which bed to lie in? Most seem to believe the second is how to prove but who knows. At least we’re trying not to run away.
The famous front-and-back cover art of the venerable 1979 textbook by John Hopcroft and Jeffrey Ullman is said to depict Cinderella:
We believe Goldilocks fits better: curly hair, wearing boots not slippers, and breaking things. Both of us recall general optimism about solving versus at the time the text was published. Now the artwork seems prophetic on what happens when we tug at the question. Can we get a “middle-way” approach to it up and functioning?
While on the subject of textbooks, we are happy to note that our textbook Quantum Algorithms Via Linear Algebra received a second printing from MIT Press, in which all of our previous errata have been corrected.
UK Independent source—and “a gentle irony” |
Roger Bannister is a British neurologist. He received the first Lifetime Achievement Award from the American Academy for Neurology in 2005. Besides his extensive research and many papers in neurology, his 25 years of revising and expanding the bellwether text Clinical Neurology culminated in being added as co-author. Oh by the way, he is that Bannister who was the first person timed under 4:00 in a mile race.
Today I cover another case of “Big Data Blues” that has surfaced in my chess work, using a race-timing analogy to make it general.
Sir Roger also served as Head of Pembroke College, one of the constituents of Oxford University. He was one of three august Rogers with whom I interacted about IBM PC-XT computers when the machines were installed at Oxford in 1984–1985. Sir Roger Penrose was among trustees of the Mathematical Institute’s Prizes Fund who granted support for my installation of an XT-based mathematical typesetting system there, a story I’ve told here. Roger Highfield and his secretary used an XT in my college’s office, and I was frequently called in to troubleshoot. While drafting this post last month, I received a mailing from Sir Martin Taylor that Dr. Highfield had just passed away—from his obit one can see that he, too, received admission to a royal order.
Dr. Bannister was interested in purchasing several XTs for scientific as well as general purposes at Pembroke. At the time, numerical performance required purchasing a co-processor chip, adding almost $1,000 to what was already a large outlay per machine by today’s standards. I wish I’d thought to say in a quick deadpan voice, “let it run four minutes and it will give you a mile of data.” (Instead, I think the 1954 race never came up in our conversation.) Today, however, data outruns us all. How to keep control of the pace is our topic.
Roger Bannister 50-year commemorative coin. Royal Mint source. |
As shown above in the commemorative coin’s design, the historic 3:59.4 time was recorded on stopwatches. We’ll stay with this older timing technology for our example.
Suppose you have a field of 200 milers. Suppose you also have a box of 50 stopwatches. For each runner you pick a stopwatch at random and measure his/her time. You get results that closely match the histogram of times that were recorded for the same runners in trials the previous day.
How good is this? You can be satisfied that the box of watches does not have a systematic tendency to be slow or to be fast for runners at that mix of levels. Projections based on such fields are valid.
The rub, however, is that you could have gotten your nice fit even if each individual watch is broken and always returns the same time. Suppose your field included Bannister, John Landy, Jim Ryun, and Sebastian Coe, with each in his prime. They would probably average close to 3:55. Hence if one of the 50 watches is stuck on 3:55, it will fit them well. It doesn’t matter if you actually draw the watch when measuring the last-place finisher. The point is that you expect to draw the watch 4 times overall and are fitting an aggregate.
Indeed, you only need the distribution of the (stopped) watches to match the distribution of the runners under random draws. You may measure a close fit not only in the quantiles but also the higher moments, which is as good as it gets. Your model may still work fine on tomorrow’s batch of runners. But at the non-aggregate level, what it did in projecting an individual runner was vapid.
Here is a hypothetical example in the predictive analytic domain of my chess model. Consider a model used by a home insurance company to judge the probabilities of damage by earth movement, wind, fire, or flood and price policies accordingly. I’ve seen policies with grainy risk-scale levels that apply to several hundred homes in a given area at one time. The company only needs good performance on such aggregates to earn its profit.
But suppose the model were fine-grained enough to project probabilities on individual homes. And suppose it did the following:
This is weird but might not be bad. If the risks average out over several hundred homes, a model like this might perform well—despite the consternation homeowners would feel if they ever saw such individual projections.
Of course, “real” models don’t do this—or do they? The expansion of my chess model which I described last Election Day has started doing this. It fixates on some moves but gives near-zero probability to others—even ones that were played—while giving fits 5–50x sharper than before. If you’ve already had experience with behavior like the above, please feel welcome to jump to the end and let us know in comments. But to see what lessons to learn from how this happens in my new model, here are details…
My chess model assigns a probability to every possible move at every game turn , based only on the values given to those moves by strong computer chess programs and parameters denoting the skill profile of a formal player . The programs list move options in order of value for the player to move, so that is the raw inferiority of in chess-specific centipawn units.
The model asserts that the parameters can be used to compute dimensionless inferiority values , from which projected probabilities are obtained without further reference to either parameters or data. The old model starts with a function that scales down the raw difference according to the overall position value . Then it defines
Lower and higher both decrease the probability of playing a sub-optimal move by dint of driving higher. The effect of is greatest when is low, so is interpreted as the player’s “sensitivity” to small differences in value, whereas governs the frequency of large mistakes and hence is called “consistency.” My conversion represents each as a power of the best-move probability , namely solving the equations
where is the number of legal moves in the position. The double exponential looks surprising but can be broken down by regarding as a “utility share” expressed in proportion to the best move’s utility , then . Alternate formulations can define directly from and the parameters, e.g. by , and/or simply normalize the shares by rather than use powers, but they seem not to work as well.
This “inner loop” defines as a probability ensemble given any point in the parameter space. The “outer loop” of regression needs to determine the that best conforms to the given data sample. The determine projections for the frequency of “matching” the computer’s first move and the “average scaled difference” of the played moves by:
The regression makes these into unbiased estimators by matching them to the actual values and in the sample. We can view this as minimizing the least-squares “fitness function”
where the weights on the individual tests are fixed ad-lib. In fact, my old model virtually always gets , thus solving two equations in two unknowns. Myriad alternative fitness functions using other statistical tests and weights help to judge the larger quality of the fit and cross-validate the results.
In my original model, all is good. My training sets for a wide spectrum of Elo ratings yield best-fit values that not only give a fine linear fit with residuals small across the spectrum, but the individual sequences and also give good linear fits to . Moreover, for all and positions the projected probabilities derived from have magnitudes that spread out over the reasonable moves .
My old model is however completely monotone in this sense: The best move(s) always have the highest , regardless of . Moreover, an uptick in the value of any move increases for every . This runs counter to the natural idea that weaker players prefer weaker moves.
The new model postulates a mechanism by which weaker moves may be preferred by dint of looking better at earlier stages of the search. A new measure called “swing” is positive for moves whose high worth emerges only late in the search, and negative for moves that look attractive early on but end with subpar values. The latter moves might be “traps” set by a canny opponent, such as the pivotal example from the 2008 world championship match discussed here.
A player’s susceptibility to “swing” is modeled by a new parameter called for “heave” as I described last November. The basic idea is that represents the “subjective value” of the move , so that represents the subjective difference in value. The idea I actually use applies swing to adjust the inferiority measure:
where is a fourth parameter and for negative is defined to be . Dropping from the second term and raising it to just the not power would be mathematically equivalent, but coupling the parameters makes it easier to try constraining and/or . (In fact, I’ve tried various other combinations and tweaks to the formulas for and , plus four other parameters kept frozen to default values in examples here. None so far has changed the picture described here.)
Note that the formulas for preserve the property for the first-listed move . When has equal-optimal value, that is , cannot be negative and is usually positive. That makes and hence reduces the share compared to . The first big win for the new model is that it naturally handles a puzzling phenomenon I identified years ago, for which my old model makes an ad-hoc adjustment.
The second big win is that can be negative even when —the swing term overpowers the other. This means the model projects the inferior move as more likely than the engine’s optimal move. This is nervy but in many cases my model correctly “foresees” the player taking the bait.
The third big win—but tantalizing—is that the extended model not only allows solving 2 more equations but often makes other fitness tests align like magic. The first of the following choices of extra equations makes an unbiased estimator for the frequency of playing a move of equal value to , which became my third cheating test after its advocacy in this paper (see also reply in this):
A typical fit that looks great by all these measures is here. It has 26,450 positions from all 497 games at standard time controls with both players rated between Elo 2040 and Elo 2060 since 2014 that are collected in the encyclopedic ChessBase Big 2017 data disc. It shows for to , then and tests related to it, then is repeated between and , and finally come four cases of for 0.01–0.10, 0.11–0.30, 0.31–0.70, and 0.71–1.50, plus four with .
Only , , , and were fitted on purpose. All the other tests follow closely like baby ducks in a row, except for some like captures and advancing versus retreating moves where human peculiarities may be identified. The value of is 5–10x as sharp as what my old model typically achieves. The new model seems to be confirming itself across the board and fulfilling the goal of giving accurate projected probabilities for all moves, not just the best move(s). What could possibly be amiss?
The first hint of trouble comes from the fitted value of being . In my old model, players rated 2050 give between and , while even the best players give . Players rated 2050 are in amateur ranks and leaves no headroom for masters and grandmasters. The value of compounds the sharpness; together with , a slight value difference (say) gets ballooned up to , giving and , which shrinks near 1-in-5,000 when and below 1-in-650,000 when . This is weirdly small—and we have not even yet involved the effects of the swing term with .
Those effects show up immediately in the file. I skip turns 1–8, so White’s 9th move is the first item. In the following position at left, Black has just captured a pawn and White has three ways to re-take, all of them reasonable moves according to the Stockfish 7 program.
Positions in game Franke-Doennebrink, 1974 at White’s 9th move (left) and Black’s 11th (right). |
Here is how my new model projects them:
NRW Class1 1314;Germany;2014.02.02;6.4;Franke, Thomas;Doennebrink, Elmar;1-0 r1b1k2r/pp2bppp/2n1pn2/3q4/3p4/2PBBN2/PP3PPP/RN1Q1RK1 w kq - 0 9; c3xd4, engine c3xd4 Eval 0.24 at depth 21; swap index 1 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 c3xd4: 0.00 0.000 0.000 0.000 0.000 1 0.79527569 2 2 Nf3xd4: 0.42 0.321 -0.035 -0.034 -0.034 0.144422 0.20472428 3 3 Be3xd4: 0.55 0.395 0.008 0.005 0.005 0.00445313 0.00000001
That’s right—it gives zero chance of a 2050-player taking with the Bishop, even though Stockfish rates that only a little worse than taking with the Knight. True, human players would say 9.Bxd4 is a stupid move because it lets Black gain the “Bishop pair” by exchanging his Knight for that Bishop. Of 155 games that ChessBase records as reaching this position, 151 saw White recapture by 9.cxd4, 4 by 9.Nxd4, and none by 9.Bxd4. So maybe the extremely low projection—for 9.Bxd4 and all other moves—has a point. But to give zero? The is the utility share, so the is actually about ; the is an imposed minimum. My original model—setting and fitting only and —spreads out the probability nicely, maybe even too much here:
M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 c3xd4: 0.00 0.000 0.000 0.000 0.000 1 0.57620032 2 2 Nf3xd4: 0.42 0.321 -0.035 -0.034 -0.034 0.280586 0.14018157 3 3 Be3xd4: 0.55 0.395 0.008 0.005 0.005 0.241178 0.10168579
At Black’s 11th turn, however, the new model gives three clearly wrong “zero” projections:
NRW Class1 1314;Germany;2014.02.02;6.4;Franke, Thomas;Doennebrink, Elmar;1-0 r1b2rk1/pp2bppp/2nqpn2/8/3P4/P1NBBN2/1P3PPP/R2Q1RK1 b - - 0 11; Rf8-d8, engine b7-b6 Eval 0.11 at depth 20; swap index 1 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 b7-b6: -0.00 -0.000 0.000 0.000 0.000 1 0.56792559 2 2 Nf6-g4: 0.18 0.163 -0.001 -0.002 -0.002 0.0907468 0.00196053 3 3 Rf8-d8: 0.18 0.163 0.042 0.046 0.046 0.00391154 0.00000001 4 4 Bc8-d7: 0.21 0.187 -0.029 -0.030 -0.030 0.278053 0.13071447 5 5 Nf6-d5: 0.28 0.241 0.047 0.050 0.050 0.00218845 0.00000001 6 6 a7-a6: 0.30 0.256 -0.049 -0.051 -0.051 0.28777 0.14001152 7 7 Qd6-c7: 0.31 0.264 -0.012 -0.012 -0.012 0.097661 0.00304836 8 8 g7-g6: 0.37 0.306 0.015 0.017 0.017 0.00355675 0.00000001 9 9 Qd6-d8: 0.39 0.320 -0.054 -0.051 -0.051 0.206264 0.06438231 10 10 Qd6-b8: 0.39 0.320 -0.037 -0.038 -0.038 0.158031 0.02787298
Owing to many other games having “transposed” here by a different initial sequence of moves, Big 2017 shows 911 games reaching this point. In 683 of them, Black played the computer’s recommended 11…b6. None played the second-listed move 11…Ng4, which reflects well on the model’s giving it a tiny . But the third-listed move 11…Rd8 gets a zero despite having been chosen by 94 players. Then 91 played the sixth-listed 11…a6, which actually gets the second-highest nod from the model, and 22 played 11…Bd7, which the new model considers third most likely. But 12 players chose 11…Nd5, four of them rated over 2300 including the former world championship candidate Alexey Dreev in a game he won at the 2009 Aeroflot Open. My old model’s fit of the same data gives 34.8% to 11…b6, 10.4% to 11…Ng4 and 7.5% to 11…Rd8 with the ad-hoc change for tied moves (would be 8.7% to both without it), and 5.1% to 11…Nd5, with eighteen moves getting at least 1%.
To be sure, this is a well-known “book” position. The 75% preference for 11…b6 doubtless reflects players’ knowledge of past games and even the fact that Stockfish and other programs consider it best. It is hard to do a true distributional benchmark of my model in selected positions because the ones with enough games are exactly the ones in “book.” Studies of common endgame positions have been tried then and now, but with the issue that the programs’ immediate complete resolution of these endgames seems to wash out much of the progression in thinking and differentiation of player skill that one would like to capture. (My cheating tests exclude all “book-by-2300+” positions and all with one side ahead more than 3.00.) Most to the point, the fitting done by my model on training data is supposed to be already the distributional test of how players of that rating class have played over many thousands of instances.
The following position is far from book and typifies the most egregious kind of mis-projection:
SVK-chT1E 1314;Slovakia;2014.03.23;11.6;Debnar, Jan;Milcova, Zuzana;1-0 2r4k/pp5p/2n5/2P1p2q/2R1Qp1r/P2P1P2/1P3KP1/4RB2 b - - 1 32; Qh5-g5, engine Qh5-g5 Eval 0.01 at depth 21; swap index 2 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 Qh5-g5: 0.00 0.000 0.129 0.137 0.000 0.0347142 0.00018527 2 2 Rc8-d8: 0.00 0.000 0.026 0.025 -0.112 1 0.74206054 3 3 a7-a6: 0.09 0.085 -0.008 -0.005 -0.142 0.117659 0.07922164 4 4 Rc8-g8: 0.21 0.187 -0.092 -0.087 -0.225 0.0996097 0.05003989 5 5 Rh4-h1: 0.25 0.219 -0.166 -0.165 -0.302 0.136659 0.11270482
This has two tied-optimal moves for Black in a position judged +0.01 to White, not a flat 0.00 draw value, yet the one that was played gets under a 1-in-5,000 projection. Here are the by-depth values that produced the high positive value:
-------------------------------------------------------------------------------------------- 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -------------------------------------------------------------------------------------------- Qg5 -117 -002 -008 +000 -015 +008 +017 +032 +011 +007 +000 +004 +001 +000 +000 +006 +001 Rd8 -089 -058 -036 -032 -013 -025 +006 +000 -010 +000 -012 -013 +001 +014 +001 +001 +001
The numbers are from White’s view, so what happened is that 32…Rd8 looked like giving Black the advantage at depths 10, 13, and 15–16, whereas 32…Qg5 looked significantly inferior (to Stockfish 7) at depth 12 and nosed in front only at depth 20 just before falling into the tie. The swing computation begins at depth 10 to evade the Stockfish-specific strangeness I noted here last year, so in particular the “rogue” values at depth 5 (and below) are immaterial. The values and differences from depth 10 onward are all relatively gentle. Hence their amounting to a tiny versus and microscopic is a sudden whiplash.
What I believe is happening to the fit is hinted by this last example giving the highest probability to the 2nd-listed move. Our first game above has two positions where the 9th-listed move gets the love. (The second, shown in full here, is notable in that the second-best move gets a zero though it is inferior by only 0.03 and was played by all three 2200+ players in the book.) This conforms to the goal of projecting when weaker players will prefer weaker moves.
This table shows that the new model quite often prefers moves other than , compared to how often they are played:
To be sure, the model is not putting 100% probability on these preferred moves, but when preferred they get a lot more probability than under my old model, which never prefers a move other than . Recall however that my old model’s fit was not too far off on these indices—and both models are fitted to give the same total probability to over all positions . Hence the probability on inferior moves is conserved but more concentrated.
Yes, greater concentration was the goal—so as to distinguish the most plausible inferior moves. But the above examples show a runaway process. The new model seems to be seizing onto properties of the distribution alone. For each we can define to be the move with the most negative value of . The also form a histogram over . The fitting process can grab it by putting all weight on plus at most a few other moves at each turn .
These few moves are the “stopped-watch reading” in my analogy. The moves given zero are the readings that cannot happen for a given runner/position. The fitting doesn’t care whether moves getting zero were played, so long as other turns fill in the histogram. If a high for —as with 32…Rd8 above—fills a gap, the fit will gravitate toward values of and that beat down all the moves with at such turns . In trials on other data, I’ve seen crash under while zooms aloft in a crazy race.
What can fix this? The maximum likelihood estimator (MLE) in this case involves minimizing the log-sum of the projected probabilities of the moves played at turn . Adding it as a weighted component of helps a little by inflating the probability of the moves that were played, but so far not a lot. Even more on-point may be maximum entropy (ME) estimation, which in this case means minimizing
There are various other ways to fit the model, including a quantiling idea I devised in my AAAI 2011 paper with Guy Haworth. In principle, and because the training data is copious, it is good to have these ways agree more than they do at present. Absent a lightning bolt that fuses them, I am finding myself locally tweaking the model in directions that optimize some “meta-fitness” function composed from all these tests.
Is this a known issue? Does it have a name? Is there a standard recipe for fixing it?
Do any deployed models have similar tendencies that aren’t noticed because there isn’t the facility for probing deeper into the grain that my chess model enjoys?
[added “at standard time controls”, a few other word changes, added game diagrams]
Cropped from source |
Bill Clinton was the 42nd President of the United States. He came close to becoming the first First Gentleman—or whatever we will call the husband of a female president. He is also a fan of crossword puzzles, and co-authored with Victor Fleming a puzzle for this past Friday’s New York Times.
Today we discuss an apparently unintended find in his puzzle. It has a Mother’s Day theme.
The puzzle was widely publicized as having a “secret message” or “Easter egg.” Many crossword puzzles have a theme constituted by the longer answers, but the Friday and Saturday NYT puzzles are usually themeless. They are also designed to be the hardest in a progression that begins with a relatively easy Monday puzzle each week. The online renditions are subscriber-only, but the Times opened this puzzle freely to the public, so you are welcome to try to solve it and find the “hidden” content before we give it away.
In a previous post we featured Margaret Farrar, the famous first crossword editor for the Times, and described how the puzzles look and work. Proper nouns such as CHILE the country, standard abbreviations, and whole phrases are fair game as answers, and they are rammed together without spaces or punctuation. For instance, the clue “Assistance for returning W.W. II vets” in Clinton’s puzzle produces the answer GIBILL. (My own father, returning from the occupation of Japan, completed his college degree under the G.I. Bill.) Some clues are fill-in-the-blank, such as “Asia’s ____ Sea” in the puzzle.
The intended hidden message is formed from three long answers symmetrically placed around the puzzle’s center. It is the signature line from a 1977 Fleetwood Mac song that Clinton has used since his 1992 presidential campaign. If you expected the puzzle to have a theme, these three lines would obviously be it.
An “Easter egg” is a side feature, usually small and local and often, as Wikipedia says, an inside joke. When I printed and did the puzzle over lunch on Friday, I missed the intended content because it wasn’t the kind I was looking for. But I did find something one can call an “Eester gee” involving the three shorter clues and answers mentioned above:
My eye had been drawn by finding Bill in his own puzzle. Winding through him is HILLAREE, indeed in three different ways but with EE in place of Y. Straining harder, one can extract CHEL- from CHILE and get -Sea from the clue for ARAL just underneath to find Chelsea, the Clintons’ only daughter.
Admittedly this is both stilted and cryptic, but it is singularly tied to the former First Family and appropriate just before Mother’s Day. Was this hidden by intent, or was it hiding by accident? Presuming the latter, what does this say about the frequency with which we can find unintended patterns? This matters not only to some historical controversies but also to cases of alleged plagiarism of writing and software code, even this investigation over song lyrics being planted in testimony.
Can we possibly judge the accidental frequency of such subjective patterns? Clinton’s puzzle allows us to experiment a little further. His only grandchild, Chelsea’s daughter, is named Charlotte. Can we find her in the same place?
Right away, CHILE and ARAL give us CHAR in a square, a promising start. There are Ls nearby, but no O. Nothing like “Lenya” or a ‘Phantom‘ reference is there to clue LOTTE. The THREE in our grid is followed by TON to answer the clue, “Like some heavy-duty trucks,” but getting the last four needed letters from there lacks even the veneer of defense of my using the I in CHILE as a connector. Is three tons a “lot”? No doting grandpa would foist that on a child. So we must reject the hypothesis that she is present.
We can attack the CHILE weakness in a similar manner. The puzzle design could have used CHELL, the player character of the classic video game Portal. HILLAREE would still have survived by using the I in Bill. However, the final L would have come below the N in the main-theme word THINKING, and it is hard to find natural answer words ending in NL. So our configuration has enough local optimality to preserve the contention that Chelsea is naturally present. Whether it is truly natural remains dubious, but it dodges this shot at refutation.
Going back, how should we regard the false-start on Charlotte? We should not be surprised that it got started. That she shares the first two letters with Chelsea may have been “correlated” if not expressly purposeful. Such correlations are a major hard-to-handle factor in cases of suspected plagiarism or illicit signaling, as both Dick and I can attest generally from experience.
Of course, this is more the stuff of potboilers and conspiracy theories than serious research. That hasn’t stopped it from commanding the input of some of our peers, however. The best-selling 1997 book The Bible Code, following a 1994 paper, alleges that sequences of Hebrew letters at fixed-jump intervals in the Torah—the first five books of the Hebrew Bible—form sensible prophetic messages to a degree far beyond statistical expectation.
The fact that Hebrew skips many vowels helps in forming patterns. For instance, arranging the start of Genesis into a 50-column crossword yields TORaH in column 6, and as Wikipedia notes here, exactly the same happens in column 8 at the start of Exodus. Even just among the consonants, some alleged messages have glitches and skips like ours with HILLAREE and CHILE. Where is the line between patching-and-fudging and true statistical surprise? Our friend Gil Kalai was one of four authors of a 1999 paper delving deep into the murk. They didn’t just critique the 1994 paper, they conducted various experiments. Some were akin to ours above with CHARLOTTE, some could be like trying to find unsavory Clinton associations in the same puzzle, and the largest was replicating many of the same kind of finds in a Hebrew text of War and Peace.
The controversy over the genesis of William Shakespeare’s plays has notoriously involved allegedly hidden messages, most famously stemming from the 1888 book The Great Cryptogram supporting Francis Bacon as their true author. Two other major claimants, Edward de Vere (the seventeenth Earl of Oxford) and Christopher Marlowe, are hardly left out. Indeed, they both get crossword finds in the most prominent place of all, the inscription on Shakespeare’s funerary monument in Stratford, England:
The inscription is singular in challenging the “passenger” (passer-by) to “read” who is embodied within the Shakespeare monument. His tomb proper is nearby in the ground. Supporters of de Vere arrange the six parts of the Latin preface into a crossword and find their man in column 2:
The leftover OL is a blemish but it might not be wasted—it could refer to “Lord Oxford” in like manner to how “Mr. W.H.” in the dedication to Shake-speares Sonnets plausibly refers to Henry Wriothesley, the Earl of Southampton, who was entreated to marry one of Oxford’s daughters throughout 1590–1593.
Supporters of Marlowe volley back in the style of a British not American crossword. Their answer constructs this part of the inscription as a cryptic-crossword clue:
Whose name doth deck this tomb, far more, then cost.
The only name on Shakespeare’s tomb is Jesus, and the Oxford English Dictionary registers ley as an old word for a bill or tax, generically a cost. The answer to the monument’s riddle thus becomes CHRISTO-FAR MORE-LEY, which is within the convex hull of how Marlowe’s name was spelled in his lifetime. The subsequent SIEH, which is most simply explained as a typo for SITH meaning “Truly,” is constructed by modern cryptic-crossword convention as “HE IS returned,” in line with theories that Marlowe’s 1593 murder was actually staged to put him under deep cover in the Queen’s secret service.
What to make of these two readings? The only solid answer Dick and I have is the same as when we are sent a claimed proof of one week and one of the next:
They can’t both be right.
Or—considering that Marlowe has recently been credited as a co-author of Shakespeare’s Henry VI cycle, and that William Stanley, who completes Wikipedia’s featured quartet of claimants, wound up marrying the above-mentioned daughter of Oxford—perhaps they can.
Where do you draw the lines among commission, coincidence, and contrivance? Where does my Clinton crossword finding fall?
Happy Mother’s Day to you and yours as well.
[fixed description of Chell character, “seventh”->”seventeenth”, added ref. to song-lyrics case, some wording tweaks]
Alternate photo by Quanta |
Thomas Royen is a retired professor of statistics in Schwalbach am Taunus near Frankfurt, Germany. In July 2014 he had a one-minute insight about how to prove the famous Gaussian correlation inequality (GCI) conjecture. It took one day for him to draft a full proof of the conjecture. It has taken several years for the proof to be accepted and brought to full light.
Today Ken and I hail his achievement and discuss some of its history and context.
Royen posted his paper in August 2014 with the title, “A simple proof of the Gaussian correlation conjecture extended to multivariate gamma distributions.” He not only proved the conjecture, he recognized and proved a generalization. The “simple” means that the tools needed to solve it had been available for decades. So why did it elude some of the best mathematicians for those decades? One reason may have been that the conjecture spans geometry, probability theory, and statistics, so there were diverse ways to approach it. A conjecture that can be viewed in so many ways is perhaps all the more difficult to solve.
Even more fun is that Royen proved the conjecture after he was retired and had the key insight while brushing his teeth—as told here. Ken recalls one great bathroom insight not in his research but in chess: In the endgame stage of the famous 1999 Kasparov Versus the World match, which became a collaborative research activity later described by Michael Nielsen in his book, Reinventing Discovery, Ken had a key idea while in the shower. His idea, branching out from the game at 58…Qf5 59. Kh6 Qe6, was the Zugzwang maneuver 60. Qg1+ Kb2 61. Qf2+ Kb1 62. Qd4!, which remains the only way for White to win.
Although solutions often come in a flash, the ideas they resolve often germinate from partial statements whose history takes effort to trace. One thing we can say is that the GCI does not originate with Carl Gauss, nor should it be considered named for him. A Gaussian measure on (centered on the origin) is defined by having the probability density
where is a non-singular covariance matrix and just means the transpose of . Its projection onto any component is a usual one-variable normal distribution.
Suppose is a 90% confidence interval for and a 90% confidence interval for another variable . What is the probability that both variables fall into their intervals? If they are independent, then it is .
What if they are not independent? If they are positively correlated, then we may expect it to be higher. If they are inversely related, well…let’s also suppose the variables have mean and the intervals are symmetric around : , . Do we still get ? This—extended to any subset of the variables with any smattering of correlations and to other shapes besides the products of intervals—is the essence of the conjecture.
Charles Dunnett and Milton Sobel considered some special cases, such as when is an outer product for some vector , which makes it positive definite. Their 1955 paper is considered by some to be the source of GCI.
But it was Olive Dunn who first posed the above general terms in a series of papers that have had other enduring influence. The first paper in 1958 and the second in 1959 bore the like-as-lentils titles:
These seem to have generated confusion. The former is longer and frames the confidence-interval problem and is the only one to cite Dunnett-Sobel, but it does not mention a “conjecture.” The latter does discuss at the end exactly the conjecture of extending a case she had proved for to arbitrary , but relates a reader’s counterexample. Natalie Wolchover ascribed the 1959 paper in her article linked above, but Wikipedia and other sources reference the 1958 paper, while subsequent literature we’ve seen has instances of citing either—and never both.
Dunn became a fellow of the American Statistical Association, a fellow of the American Association for the Advancement of Science (AAAS), and a fellow of the American Public Health Association. In 1974, she was honored as the annual UCLA Woman of Science, awarded to “an outstanding woman who has made significant contributions in the field of science.” Her third paper in this series, also 1959, was titled “Confidence intervals for the means of dependent normally distributed variables.” Her fourth, in 1961, is known for the still-definitive form of the Bonferroni correction for joint variables. But in our episode of “CSI: GCI” it seems we must look later to find who framed the conjecture as we know it.
Not an ad. Amazon source. So is it an ad? |
Sobel came back to the scene as part of a 1972 six-author paper, “Inequalities on the probability Content of Convex Regions for Elliptically Contoured Distributions.” They considered integrals of the form
for general functions besides and for general positive definite . GCI in this case then has the form where is the identity matrix. They call elliptically contoured provided is finite. Writing about the history, they say (we have changed a few symbols and the citation style):
Inequalities for perhaps originate with special results of Dunnett and Sobel (1955) and of Dunn (1958), in which it is shown that for special forms of (with ) or for special values of .
They mention also an inequality by David Slepian and what they termed “the most general result for the normal distribution” by Zbynek Šidák, still with special conditions on . Their main result is “an extension of Šidák’s result to general elliptically contoured densities [plus] a stronger version dealing with a convex symmetric set.” This is where the relaxation from products of confidence intervals took hold. At last, after their main proof in section 2 and discussion in section 3, we find the magic word “conjecture”:
This suggests the conjecture: if is a random vector (with of dimension and of dimension ) having density and if and are convex symmetric sets, then
where
Clearly by iteration this implies the inequality with regard to . Here symmetric means just that belongs whenever belongs. Any symmetric convex set can be decomposed into strips of the form for fixed and , which their generality set them up to handle, and proving the inequality for strips suffices. This is considered the modern statement of GCI. The rest of their paper—over half of it—treats attempts to prove it and counterexamples to some further extensions.
Finally in 1977, Loren Pitt proved the case , referencing the 1972 paper and Šidák but not Dunnett-Sobel or Dunn. Wolchover interviewed Pitt for her article, and this extract is revealing:
Pitt had been trying since 1973, when he first heard about [it]. “Being an arrogant young mathematician … I was shocked that grown men who were putting themselves off as respectable math and science people didn’t know the answer to this,” he said. He locked himself in his motel room and was sure he would prove or disprove the conjecture before coming out. “Fifty years or so later I still didn’t know the answer,” he said.
So as for framing GCI, whodunit? Royen ascribes it to the 1972 paper which is probably what popularized it to Pitt, but Dunn’s orthogonal-intervals formulation spurred the intervening work, accommodates extensions noted as equivalent to GCI by Royen citing this 1998 paper, and still didn’t get solved until Royen. So we find these two sources equally “guilty.”
The 1972 form of GCI has a neatly compact statement and visualization:
For any symmetric convex sets in and any Gaussian measure on centered at the origin,
That is, imagine overlapping shapes symmetric about the origin in some Euclidean space. Throw darts that land with a Gaussian distribution around the origin. The claim is that the probability that the a dart will land on both shapes is at least the probability that it will land in one shape times the probability that it lands in the other shape.
UK Daily Mail source |
George Lowther, in his blog “Almost Sure,” has an interesting post about early attempts to solve GCI. He notes the following partial results from the above-mentioned 1998 paper:
The first statement proves GCI in a “shrunken” sense, while the second makes that seem tantamount to solving the whole thing. Lowther explained, however:
Unfortunately, the constant in the first statement is , which is strictly less than one, so the second statement cannot be applied. Furthermore, it does not appear that the proof can be improved to increase to one. Alternatively, we could try improving the second statement to only require the sets to be contained in the ball of radius for some but, again, it does not seem that the proof can be extended in this way.
Royen did not use this idea—indeed, Wolchover quotes Pitt as saying, “what Royen did was kind of diametrically opposed to what I had in mind.” Instead she explains how Royen used a kind of smoothing between the original matrix and (with off-diagonal entries zeroed out as above) as a quantity varies from to , taking derivatives with respect to . For this he had tools involving transforms and other tricks at hand:
“He had formulas that enabled him to pull off his magic,” Pitt said. “And I didn’t have the formulas.”
Royen’s short paper does need the background of these tricks to follow, and the fact that the same tricks enabled a further generalization of GCI makes it harder. The proof was made more self-contained in this 2015 paper by Rafał Latała and Dariusz Matlak (final version) and in a 2016 project by Tianyu Zhou and Shuyang Shen at the University of Toronto, both focusing just on GCI and cases closest to Dunn’s papers. Rather than go into proof details here, we’ll say more about the wider context.
Independent events are usually the best type of events to work with. Recall if and are independent events then,
Of course actually more is true: . But we focus on the inequality, since it can hold when and are not independent. In general without some assumption on the events and the above inequality is not true: Consider the event fair coin is heads and that it is tails. Then becomes .
Since independence is not always true for two events it is of great value to know when is still true. Even an approximation is of great value. Note, a simple case where it still is true is when , then the inequality is trivial, .
GCI reminds us of another inequality that intuitively cuts very fine and was difficult to prove: the FKG inequality. Ron Graham wrote a survey of FKG that begins with a discussion of Chebyshev’s sum inequality, named after the famous Pafnuty Chebyshev.
Chebychev’s sum inequality states that if
and
then
Wikipedia’s FKG article says how the relevance expands to other inequalities:
Informally, [FKG] says that in many random systems, increasing events are positively correlated, while an increasing and a decreasing event are negatively correlated.
An earlier version, for the special case of i.i.d. variables, … is due to Theodore Edward Harris (1960) … One generalization of the FKG inequality is the Holley inequality (1974) below, and an even further generalization is the Ahlswede-Daykin “four functions” theorem (1978). Furthermore, it has the same conclusion as the Griffiths inequalities, but the hypotheses are different.
We wonder whether the new results on GCI will spur an over-arching appreciation of all these inequalities involving correlated variables. We also wonder if in the complex case there is any connection between Royen’s smoothing technique and the process of purifying a mixed quantum state.
The amazing personal fact is that a retired mathematician solved the problem and did it with a relatively simple proof. What does this say about our core conjectures in theory? I am near retirement from Georgia Tech—does that mean I will solve some major open problem? Hmmmmmmm.
Also, which of you have had key insights come in the bathroom?
[nonsingular R–>positive definite R, other tweaks]
Boaz Barak and Michael Mitzenmacher are well known for many great results. They are currently working not on a theory paper, but on a joint “experiment” called Theory Fest.
Today Ken and I want to discuss their upcoming experiment and spur you to consider attending it.
There are many pros and some cons in attending the new Theory Fest this June 19-23. One pro is where it is being held—Montreal—and another is the great collection of papers that will appear at the STOC 2017 part of the Fest. But the main ‘pro’ is that Boaz and Mike plan on doing some special events to make the Fest more than just a usual conference on theory.
The main ‘con’ is that you need to register soon here, so do not forget to do that.
We humbly offer some suggestions to spice up the week:
A Bug-a-thon: Many conferences have hack-a-thons these days. A theory version could be a P=NP debugging contest. Prior to the Fest anyone claiming to have solved P vs NP must submit a paper along with a $100 fee– -Canadian. At the Fest teams of “debuggers” would get the papers and have a fixed time—say three hours—to find a bug in as many papers as they can. The team that debugs the most claims wins the entrance fees.
Note that submissions can be “stealth”—you know your paper is wrong, but the bugs are very hard to find.
Present a Paper: People submit a deck for a ten minute talk. Then randomly each is assigned a deck and they must give a talk based only on the deck. There will be an audience vote and the best presenter will win a trophy.
Note there are two theory issues. The random assignment must be random but fixed-point free—-no one can get their own deck. Also since going last seems to give an unfair advantage, we suggest that each person gets the deck only ten minutes before their talk. Thus all presenters would have the same time to prepare for their talk.
Silent Auction For Co-authorship: We will set up a series of tables. On each table is a one page abstract of a paper. You get to bid as in a standard silent auction. The winner at each table becomes a co-author and pays their bid to STOC. The money could go to a student travel fund.
The A vs B Debate: Theory is divide into A and B at least in many conferences. We will put together a blue ribbon panel and have them discuss: Is A more important than B? We will ask that the panel be as snippy as possible—a great evening idea while all drink some free beer.
Betting: We will have a variety of topics from P=NP to quantum computation where various bets can be made.
Cantal Complexity: The Fest will mark the 40th anniversary of Donald Knuth’s famous paper, “The Complexity of Songs.” Evening sessions at a pub will provide unprecedented opportunity for applied research in this core area. Ken’s research, which he began with Dexter Kozen and others at the ICALP 1982 musicfest, eventually led to this.
Lemmas For Sale: In an Ebay-like manner a lemma can be sold. We all have small insights that we will never publish, but they might be useful for others.
Zoo Excursion: This is not to the Montreal zoo—which is rather far—but to the Complexity Zoo which is housed elsewhere in Canada. Participants will take a virtual tour of all 535 classes. The prize for “collapsing” any two of them will be an instant STOC 2017 publication. In case of collapsing more than two, or actually finding a new separation of any pair of them, see under “Bug-a-thon” above.
Write It Up: This is a service-oriented activity. Many results never have been written up formally and submitted to journals. Often the reason is that the author(s) are busy with new research. This would be a list of such papers and an attempt to get students or others to write up the paper. This has actually happen many times already in an informal manner. So organizing it might be fun. We could use money to get people to sign up—or give a free registration to next years conference— for example.
GLL plans on gavel-to-gavel coverage of the Fest: we hope to have helpers that will allow us to make at least one post per day about the Fest. Anyone interested in being a helper should contact us here.
This will be especially appreciated because Ken will be traveling to a different conference in a voivodeship that abuts an oblast and two voblasts.
]]>