As moderator of RSA 2016 panel |
Paul Kocher is the lead author on the second of two papers detailing a longstanding class of security vulnerability that was recognized only recently. He is an author on the first paper. Both papers credit his CRYPTO 1996 paper as originating the broad kind of attack that exploits the vulnerability. That paper was titled, “Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems.” Last week, the world learned that timing attacks can jeopardize entire computing systems, smartphones, the Cloud, everything.
Today we give a high-level view of the Meltdown and Spectre security flaws described in these papers.
Both flaws are at processor level. They are ingrained in the way modern computers operate. They are not the kind of software vulnerabilities that we have discussed several times before. Both allow attackers to read any memory location that can be mapped to the kernel—which on most computers allows targeting any desired memory contents. Meltdown can be prevented by software patches—at least as we know it—but apparently no magic bullet can take out Spectre.
Kocher was mentioned toward the end of a 2009 post by Dick titled, “Adaptive Sampling and Timed Adversaries.” This post covered Dick’s 1993 paper with Jeff Naughton on using timing to uncover hash functions. Trolling for hash collisions and measuring the slight delays needed to resolve them required randomized techniques and statistical analysis to extract the information. No such subtlety is needed for Meltdown and Spectre—only the ability to measure time in bulky units.
The attacks work because modern processors figuratively allow cars—or trolleys—to run red lights. An unauthorized memory access will raise an exception but subsequent commands will already have been executed “on spec.” Or if all the avenue lights are green but the car needs to turn at some point, they will still zoom it ahead at full speed—and weigh the saving of not pausing to check each cross-street versus the minimal backtrack needed to find a destination that is usually somewhere downtown.
Such speculative execution leverages extra processing capacity relative to other components to boost overall system speed. The gain in time from having jumped ahead outweighs the loss from computing discarded continuations. The idea of “spec” is easiest to picture when the code has an if-else branch. The two branches usually have unequal expected frequencies: the lesser one may close a long-running loop that the other continues, or may represent failure of an array-bounds test that generally succeeds. So the processor applies the scientific principle of induction to jump always onto the fatter branch, backtracking when (rarely) needed.
Meltdown applies to the red-light situation, Spectre to branches. Incidentally, this is why the ghost in the logo for Spectre is holding a branch:
The logos were designed by Natascha Eibl of Graz, Austria, whose artistic website is here. Four authors of both papers are on the faculty of the Graz University of Technology, which hosts the website for the attacks. The Graz team are mostly responsible for the fix to Meltdown called KPTI for “kernel page-table isolation,” but the Spectre attack is different in ways that make it inapplicable.
There have been articles like this decrying the spectre of a meltdown of the whole chip industry. We’ll hold off on speculating about impending executions and stay with describing how the attacks work.
The Meltdown paper gives details properly in machine code, but we always try to be different, so we’ve expressed its main example in higher-level C code to convey how an attacker can really pull this off.
To retrieve the byte value at a targeted location in the kernel’s virtual memory map , the attacker can create a fresh array of objects whose width is known to be the page size of the cache. The contents of don’t matter, only that initially no member has been read from, hence not cached. The attacker then submits the following code using a process fork or try-catch block:
object Q; //loaded into chip memory byte b = 0; while (b == 0) { b = K[x]; //violates privilege---so raises an exception } Q = A[b]; //should not be executed but usually is //continue process after subprocess dies or exception is caught: int T[256]; for (int i = 0; i < 256; i++) { T[i] = the time to find A[i]; } if T has a clear minimum T[i] output i, else output 0.
What happens? Let’s first suppose . By “spec” the while-loop exits and the read from happens before the exception kills the child. This read generally causes the contents of to be read into the cache and causes the system to note this fact about the index . This note survives when the second part of the code is (legally) executed and causes the measured time to be markedly low only when , because only that page is cached. Thus the value of is manifested in the attacker’s code, which can merrily continue executing to get more values.
The reason for the special handling of is that a “race condition” exists whose outcome can zero out or leave its initial value untouched. The while-loop keeps trying until the race is won. If the secret value really is zero then the loop will either raise the exception or iterate until a segmentation fault occurs. The latter causes Q = A[0] not to be executed, but then the initial condition that no page of is cached still holds, so no time is markedly lower, so is returned after all.
A second key point is that Intel processors allow the speculatively executed code the privilege needed to read . Again, the Meltdown paper has all the details expressed properly, including how cache memory is manipulated and how to measure each without dislodging from its cache. The objects need not be as big as the page size —they only need to be spaced apart in . This and some other tunings enabled the Meltdown paper’s experiment to read over 500,000 supposedly-protected bytes per second.
The Spectre attack combines “spec” with the old bugaboo of buffer overflows. Enforcing array bounds is not only for program correctness but also for securing boundaries between processes. The attacker uses an array with size bound and an auxiliary array of size . The attacker needs to discover some facts about the victim’s code and arrange that addresses for overflowing will map into the victim’s targeted memory. The first idea is to induce enough accesses with valid to train the branch predictor to presume compliance, so that it will execute in advance the body of the critical line of code:
if (x < s) { y = A[256*B[x]]; }
To create the delay during which the body will execute speculatively, the attacker next thrashes the cache so that will likely be evicted from it. Finally, the attacker chooses some and re-enters the critical line. The bounds check causes a cache miss so that the "spec" runs. Not only will it deliver the targeted byte , it will cache it at the spaced-out location (the page size is not involved). Then the value of is recovered much as with Meltdown.
Spectre is more difficult to exploit but what makes it scarier is that now the out-of-bounds access is not treated as guarded by a higher privilege level: it involves the attacker’s own array . Even if the attacker has limited access to , there is a way: Call the critical line with random valid a few hundred times. There is a good chance that a valid value that equals will be found. Since is in the cache, the line y = A[256*B[x']] will execute faster than the other cases. Detecting the faster access again leaks .
The paper concludes with variants that allow similar timing attacks to operate under even weaker conditions. One does not even need to manipulate the cache beforehand. The last one is titled:
Leveraging arbitrary observable effects.
That is, let stand for “something using that is observable when speculatively executed.” Then the following lines of code are all it takes to compromise the victim’s data:
if (x < s) { y = A[256*B[x]]; O(y); }
We’ve talked about timing attacks, but can we possibly devise a concept of timing defenses? On first reflections the answer is no. Ideas of scrambling data in memory space improve security in many cases, but scrambling computations in time seems self-defeating. Changing reported timings randomly and slightly in the manner of differential privacy is useless because the timing difference of the cached item is huge. Computers need to assure fine-grained truth in timing anyway.
Besides timing, there are physical effects of power draw and patterns of electric charge and even acoustics in chips that have been exploited. Is there any way the defenders can keep ahead of the attackers? Can the issues only be fixed by a whole new computing model?
[some word and format changes, qualified remark on software nature in intro and linked more related posts]
Muhammad Afzal Upal is Chair of the Computing and Information Science Department at Mercyhurst University. He works in machine learning and cognitive science, most specifically making inferences from textual data. In joint papers he has refined a quantitative approach to the idea of postdiction originally suggested by Walter Kintsch in the 1980s.
Today we review some postdictions from 2017 and wish everyone a Happy New Year 2018.
In a 2007 paper, the postdictability of a concept is defined as “the ease with which a concept’s inclusion in the text can be justified after the textual unit containing that concept has been read.” This contrasts with “the ease with which the occurrence of the concept can be predicted prior to the concept having been read.” The main equation defines the extent to which the concept—or event, I may add—is memorable by
where is the prior likelihood of emerging and is a constant. It says that the concept is most memorable if you couldn’t have predicted it but after you see it you slap your forehead and say, “Ah, of course!” It relates to what makes ideas “stick.”
Mercyhurst is in Erie, Pennsylvania. Erie had lots of snow this past week. Record–breaking snow. More than Buffalo usually gets. We had several relatives and friends who had to drive through it on their way to Michigan and Pittsburgh and points further south. Was that karma? coincidence with this post? memorable in a way that fits the framework?
And how about the Buffalo Bills making the playoffs after a miracle touchdown by the Cincinnati Bengals on fourth-and-12 from midfield in the final minute knocked out the Baltimore Ravens? In designing a two-week unit on data science for my department’s new team-taught Freshman Seminar on “The Internet,” I had the foresight to use years-since-a-team’s-last-playoff-win (not last playoff appearance) as the definition of “playoff drought” in activity examples. Hence—unless the Bills upset the Jacksonville Jaguars next Sunday—the local “nudge” of my materials will work equally well for next fall’s offering. Can one quantify my prescience as prediction? postdiction? Let’s consider some more-germane examples.
Last January we did not do a predictions or year-in-review post as we had done in all seven previous years. We were caught up in questions over László Babai’s graph isomorphism theorem and other matters. Several predictions were recurring, so let’s suppose we made them also for 2017:
Since some of our perennial questions have entered a steady state, it is time to find new categories. A week-plus ago we noticed that Altmetric publishes a top-100 list based on their “Altmetric Attention Score” every November 15. So it is natural to suppose we postulated:
The answer with regard to the 2017 list is “yes” but the reason is unfortunate—it is Norbert Blum’s incorrect paper coming in at #38. Blum gave a formal retraction and subsequent explanation, which we added as updates to our own item on the claim. The only (other) paper labeled “Information and Computer Sciences” is the AlphaGo Zero paper at #74. Actually, Blum’s paper was tagged “Research and Reproducibility.”
AlphaGo Zero and most recently AlphaZero spring to mind. With much swallowing of pride from my having started out as a chess-player in the early 1970s when computers were minimal, I’m not sure that games of perfect information should ultimately be regarded as “human-centric.” Based on my current understanding of the AlphaZero paper and comments by Gwern Branwen in our previous post, what strikes me as the most stunning fact is the following:
Chess can be encoded into a ‘kernel’ of well under 1GB such that + small search comprehensively outperforms an almost 1,000x larger search.
More on the human-centric side, however—and allowing supervision—the most surprise and attention seems to have gone to the 2017 Stanford study adapting off-the-shelf facial-analysis software to distinguish sexual orientation from photographs with accuracy upwards of 90%, compared to human subjects at 52% from a balanced sample, which is barely better than random. For utility we would nominate LipNet, which achieves over 95% accuracy in lip-reading from video data, but the paper dates to December 2016.
The lip-reading success may be the more predictable. The extent to which it and the “gaydar”application are postdictable appears to be the same as our reaching a community understanding of what deep neural nets are capable of—which does not require being able to explain how they work. Setting up grounds beforehand for the justification by which Upal and his co-authors define postdiction might be a fair way of “giving credit for a postdiction.”
Per Lance Fortnow in his own 2017 review, the complexity result of the year is split between two papers claiming to prove full dichotomy for nonuniform CSPs—where dichotomy means that they are either in P or NP-complete. Meanwhile we have devoted numerous posts to Jin-Yi Cai’s work on dichotomy between P and #P-completeness, including recently. So can we get some credit for prediction? or for postdiction? Anyway, we make it a prime prediction for 2018 that there will be notable further progress in this line.
We specify quantum supremacy to mean mean building a physical device that achieves a useful algorithmic task that cannot be performed in equal time by classical devices using commensurate resources. The words “useful” and “commensurate” are subjective but the former rules out stating the task as “simulating natural quantum systems” and furthers John Preskill’s emphasis in his original 2012 supremacy paper that the quantum device must be controllable. The latter rules out using whole server farms to match what a refrigerator-sized quantum device can do. The notion involves concrete rather than asymptotic complexity, so we are not positing anything about the hardness of factoring, and intensional tasks like Simon’s Problem don’t count—not to mention our doubts on the fairness of its classical-quantum comparison. Aram Harrow and Ashley Montanaro said more about supremacy requirements in this paper.
Our “postdiction” gets a yes for 2017 from the claims in this Google-led paper that 50-qubit devices would suffice to achieve supremacy and are nearly at hand, versus this IBM-led rebuttal showing that classical computers can emulate a representative set of 50-qubit computations. The notion of emulation allows polling for state information of the quantum circuit computation being emulated, so this is not even confronting the question of solving the task by other means—or proving that classical resources of some concrete size cannot solve all length- cases of the task at all. Recent views on the controversy are expressed in this November 14 column in Nature, which links this October 22 post by Scott Aaronson (see also his paper with Lijie Chen, “Complexity-Theoretic Foundations of Quantum Supremacy Experiments”) and this December 4 paper by Cristian and Elena Calude which evokes the Google-IBM case.
That is, the notions of algorithm and protocol will fuse into greater structures with multiple objectives besides solving a task. In 2016 we noted Noam Nisan’s 2012 Gödel Prize-winning paper with Amir Ronen titled “Algorithm Mechanism Design.” Noam’s 2016 Knuth Prize citation stated, “A mechanism is an algorithm or protocol that is explicitly designed so that rational participants, motivated purely by their self-interest, will achieve the designer’s goals.” In November we covered mechanisms for algorithmic fairness. There is a nicely accessible survey titled “Algorithms versus mechanisms: how to cope with strategic input?” by Rad Niazadeh who works in this area. It alloys techniques from many veins of theory and has a practical gold mine of applications. What we are watching for is the emergence of single powerful new ideas from this pursuit.
We see some sign of this in the dichotomy-for-CSPs result, but we have thoughts that we will talk more about in a later post.
What concepts do you think will have the highest in 2018?
[some word and spacing changes]
YouTube 2015 lecture source |
David Silver is the lead author on the paper, “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” which was posted twelve days go on the ArXiv. It announces an algorithm called AlphaZero that, given the rules of any two-player game of strategy and copious hardware, trains a deep neural network to play the game at skill levels approaching perfection.
Today I review what is known about AlphaZero and discuss how to compare it with known instances of perfect play.
AlphaZero is a generalization of AlphaGo Zero, which was announced last October 18 on the Google DeepMind website under the heading “Learning From Scratch.” A paper in Nature, with Silver as lead author, followed the next day. Unlike the original AlphaGo, whose victory over the human champion Lee Sedol we covered, AlphaGo Zero had no input other than the rules of Go and some symmetry properties of the board. From round-the-clock self-play it soon acquired as tutor the world’s best player—itself.
The achievements in Go and Shogi—the Japanese game whose higher depth in relation to Western chess we discussed three years ago—strike us as greater than AlphaZero’s score of 28 wins, 72 draws, and no losses against the champion Stockfish chess program. One measure of greatness comes from the difference in Elo ratings between the machine and the best human players. AlphaGo Zero’s measured rating of 5185 is over 1,500 points higher than the best human players on the scale used in Go. In Shogi, the paper shows AlphaZero zooming toward 4500 whereas the top human rating shown here as of 11/26/17 is 2703, again a difference over 1,500. In chess, however, as shown in the graphs atop page 4 of the paper, AlphaZero stays under 3500, which is less than 700 ahead of human players.
Although AlphaZero’s 64-36 margin over Stockfish looks like a shellacking, it amounts to only 100 points difference on the Elo scale. The scale was built around the idea that a 200-point difference corresponds to about 75% expectation for the stronger player—and this applies to all games. Higher gains become multiplicatively harder to achieve and maintain. This makes the huge margins in Go and Shogi all the more remarkable.
There has been widespread criticism of the way Stockfish was configured for the match. Stockfish was given 1 minute per move regardless of whether it was an obvious recapture or a critical moment. It played without its customary opening book or endgame tables of perfect play with 6 or fewer pieces. The 64 core threads it was given were ample hardware but they communicated position evaluations via a hash table of only one gigabyte, a lack said to harm the accuracy of deeper searches. However hobbled, what stands out is that Stockfish still drew almost three-fourths of the games, including exactly half the games it played Black.
I have fielded numerous queries these past two weeks about how this affects my estimate that perfect play in chess is rated no higher than 3500 or 3600, which many others consider low. Although the “rating of God” moniker is played up for popular attention, it really is a vital component in my model: it is the -intercept of regressions of player skill versus model parameters and inputs. I’ve justified it intuitively by postulating that slightly randomized versions of today’s champion programs could score at least 10–15% against any strategy. I regard the ratings used for the TCEC championships as truer to the human scale than the CCRL ratings. TCEC currently rates the latest Stockfish version at 3226, then 3224 for Komodo and 3192 for the Houdini version that won the just-completed 10th TCEC championship. CCRL shows all of Houdini, Komodo, and an assembly-coded version of Stockfish above 3400. Using the TCEC ratings and the official Elo “p-table” implies that drawing 2 or 3 of every 20 games holds the stronger player to the 3500–3600 range.
Of course, the difference from Go or Shogi owes to the prevalence of draws in chess. One ramification of a post I made a year ago is that the difference is not merely felt at the high end of skill. The linear regressions of Elo versus player error shown there are so sharp () that the -intercept is already well determined by the games of weaker players alone.
Overall, I don’t know how the AlphaZero paper affects my estimates. The Dec. 5 paper is sketchy and only 10 of the 100 games against Stockfish have been released, all hand-picked wins. I share some general scientific caveats voiced by AI researcher and chess master Jose Camacho-Collados. I agree that two moves by AlphaZero (21. Bg5!! and 30.Bxg6!! followed by 32.f5!! as featured here) were ethereal. There are, however, several other possible ways to tell how close AlphaZero comes to perfection.
One experiment is simply to give AlphaZero an old-fashioned examination on test positions for which the perfect answers are known. These could even be generated in a controlled fashion from chess endgames with 7 or fewer pieces on the board, for which perfect play was tabulated by Victor Zakharov and Vladimir Makhnichev using the Lomonosov supercomputer of Moscow State University. Truth in those tables is often incredibly deep—in some positions the win takes over 500 moves, many of which no current chess program (not equipped with the tables) let alone human player would find. Or one can set checkmate-in- problems that have stumped programs to varying degrees. The question is:
With what frequency can the trained neural network plus Monte Carlo Tree Search (MCTS) from the given position find the full truth in the position?
The trained neural network supplies original probabilities for each move in any given position . AlphaZero plays games against itself using those probabilities and samples the results. It then adjusts parameters to enhance the probabilities of moves having the highest expectation in the sample, in a continuous and recursive manner for positions encountered in the search from . The guiding principle can be simply stated as:
“Nothing succeeds like success.”
We must pause to reflect on how clarifying it is that this single heuristic suffices to master complex games—games that also represent a concrete face of asymptotic complexity insofar as their size n-by-n generalizations are polynomial-space hard. A famous couplet by Piet Hein goes:
Problems worthy of attack
prove their worth by hitting back.
It may be that we can heuristically solve some NP-type problems better by infusing an adversary—to make a PSPACE-type problem that hits back—and running AlphaZero.
As seekers of truth, however, we want to know how AlphaZero will serve as a guide to perfection. We can regard a statement of the form, “White can win in 15 moves” (that is, 29 moves counting both players) as a theorem for which we seek a proof. We can regard the standard alpha-beta search backbone as one proof principle and MCTS as another. Which ‘logic’ is more powerful and reliable in practice?
A second way to test perfection is to take strategy games that are just small enough to solve entirely, yet large enough that stand-alone programs cannot play perfectly on-the-fly. One candidate I offer is a game playable with chess pawns or checkers on a board with 5 rows and columns, where perhaps can be set to achieve the small-enough/large-enough balance. I conceived this 35 years ago at Oxford when seemed right for computers of the day. The starting position is:
Each player’s pawns move one square forward or may “hop” over an opposing piece straight forward or diagonally forward. If some hop move is legal then the player must make a hop move. The hopped-over piece remains on the board. If a pawn reaches the last row it becomes a king and thereupon moves or hops backwards. No piece is ever captured.
The goal is to make your opponent run out of legal moves. If a king reaches the player’s first row it can no longer move. This implies a fixed horizon on a player’s count of moves. The trickiest rules involve double-hops: If a single hop and double hop are available then the double hop is not mandatory, but if a pawn on the first row begins a double hop it must complete it. Upon becoming a king after a hop, however, making a subsequent return hop is optional, except that a king that makes the first leg of a returning double hop must make the second leg. A final optional rule is to allow a king to move one cell back diagonally as well as straight.
From the starting position, White can force Black to hop four times in four moves by moving to a3, a4, d3, and d4. Then White still has the initiative and can either make a king or force another hop; the latter, however, forces White to hop diagonally in return. This seems like a powerful beginning but the subsequent Black initiative also looked strong. My playing partners at Oxford and I found that positional considerations—making the opponent’s pieces block themselves—mattered as much as the move-count allowance. This made it challenging and fun enough for casual human play, but we knew that computers should make quick work of it.
The point of using this or some other solved game would be to compare the strategy produced by the AlphaZero procedure against perfection—and against programs that use traditional search methods.
What do you think are the significances of the AlphaZero breakthrough?
]]>
An old result put a new way (in a now-fixed-up post)
Albert Meyer knows circuit lower bounds. He co-authored a paper with the late Larry Stockmeyer that proves that small instances of the decision problem of a certain weak second-order logical theory require Boolean circuits with more gates than there are atoms in the observable universe. The instances almost fit into two Tweets using just the Roman typewriter keys.
Today Ken and I discuss a simple but perhaps underlooked connection between P=NP and circuit lower bounds.
Albert recently co-authored, with Eric Lehman of Google and his MIT colleague Tom Leighton, the textbook Mathematics for Computer Science. It looks familiar to us because it uses the same MIT Press fonts and layout package as our quantum computing book. They say the following in their Introduction:
Simply put, a proof is a method of establishing truth. Like beauty, “truth” sometimes depends on the eye of the beholder, and it should not be surprising that what constitutes a proof differs among fields.
We would say that the game is not only about making truth evident from a proof but also from the way a theorem statement is expressed. This post uses an old result of Albert’s as an example.
Well, any criticism of Albert for how his theorem was stated is really criticism of myself, because Dick Karp and I were the first to state it in a paper. Here is exactly how we wrote it in our STOC 1980 version:
There was no paper by Albert to cite. In our 1982 final version we included a proof of his theorem:
As you can see, our proof—Albert’s proof?—used completeness for and some constructions earlier in the paper. In a preliminary section we wrote that our proofs about classes such as involved showing inclusions
“where the set of strings is complete in with respect to an appropriate reducibility.”
But in this case the proof does not need completeness for . I came up with this realization on Wednesday and Ken found essentially the same proof in these lecture notes by Kristoffer Hansen:
This proof uses not only for but also for . For the latter it suffices to have polynomial-time reduce to . This follows so long as is complete for some class to which also belongs.
Suppose runs in deterministic time . Then both and belong to , the latter because on input we can run for steps. With minimal assumptions on the function , has complete sets , and then follows from . So we can state the theorem more generally:
In fact we would get into and even smaller classes but that’s going beyond our simple point.
Our point comes through if we think of a concrete case like deterministic time , called for quasipolynomial time. So we have:
Corollary 2 If then .
Now mentally substitute for (and `‘ for `‘) in the way Karp and I summarized the final implication in our paper:
What you get after contraposing and using the hierarchy theorem for is:
Corollary 3 If then .
The point is that we can also do this for time and even smaller proper super-classes of . What follows is:
Any attempt to prove entails proving strong nonuniform circuit lower bounds on languages that are arbitrarily close to being in .
Again in the case of this implication too has been variously noted. Scott Aaronson mentions it in one sentence of his great recent 121-page survey on the versus question (p65):
“[I]f someone proved , that wouldn’t be a total disaster for lower bounds re-search: at least it would immediately imply (via ).”
Maybe I (Dick) considered this in terms of in weighing my thoughts about . But that it applies to in place of gives me pause. This greatly amplifies idle thoughts about the irony of how proving yields the same type of lower bounds against that are involved in the “Natural Proofs” barrier against proving . Ryan Williams had to combine many ideas just to separate from nonuniform —not even getting on the left nor on the right. (Incidentally, we note this nice recent MIT profile of Ryan.) So having such lower bounds for just drop from the sky upon seems jarring.
So I’m rethinking my angle on . I’ve always propounded that good lower bounds flow like ripples from new upper bounds, but the wake of seems a tsunami. We wonder if Bill Gasarch will do a 3rd edition of his famous poll about versus . Ken and I offset each other with our votes last time, but maybe not this time.
We also wonder whether Theorem 1 can be given even stronger statements in ways that are useful. In the original version of this post we overlooked a point noted first by Ryan Williams here and thought we had . To patch it, call a language in “reflective” if there is a TM running in exponential time such that and (namely, the “tableau” language defined above) polynomial-time reduces to . The complete sets mentioned above for classes within are reflective. If we let denote the subclass of reflective languages, then we can say:
Note that per Lance Fortnow’s comment here, sparse languages are candidates for being non-reflective: the tableau language which we would wish to polynomial-time Turing reduce to is generally dense.
Is this realization about and strong circuit lower bounds arbitrarily close to really new? Can our readers point us to other discussions of it?
Is the notion of “reflective” known? useful?
[fixed error in original Theorem 1 and surrounding text; added paragraph about it before “Open Problems”; moved query about “cosmological” formulas to a comment.]
Kurt Gödel is feeling bored. Not quite in our English sense of “bored”: German has a word Weltschmerz meaning “world-weariness.” In Kurt’s case it’s Überweltschmerz. We have tried for over a month to get him to do another interview like several times before, but he keeps saying there’s nothing new to talk about.
Today we want to ask all of you—or at least those of you into logic and complexity—whether we can find something to pep Kurt up. Incidentally, we never call him Kurt.
Gödel is of course famous among many things for proving that Peano Arithmetic (PA) cannot prove its own consistency—unless PA is already inconsistent. In the latter case, it would be able to prove everything; and this would imply that PA is useless.
This time we want to talk about whether PA might be proved consistent in weaker senses. The senses would escape the result of Gödel, which is usually called his Second Incompleteness Theorem. A hope is that they can be connected to open questions in complexity theory.
Recall that PA is the first order theory of arithmetic with induction. Actually all we say today could be generalized to many other theories, but to help us focus let’s discuss only PA.
The meaning of consistency for PA can be encoded as follows: Let
mean that encodes a proof in PA of the statement . Formally this means that consistency of PA is
Most mathematicians believe that PA is consistent. The best argument for consistency is probably that the axioms of PA all seem “obvious.” That is they seem to conform to our intuition about arithmetic. Even the powerful induction schema—which pumps out one axiom for each applicable formula—says something that seems clear:
Mathematical induction proves that we can climb as high as we like on a ladder, by proving that we can climb onto the bottom rung (the basis) and that from each rung we can climb up to the next one (the induction).
This is from page 3 of the book Concrete Mathematics.
Yet not everyone believes that PA is consistent, and it follows that not everyone believes that PA is robustly useful. We recently covered the late Vladimir Voevodsky’s doubts. Ed Nelson in 2015 wrote a freely-available book titled simply Elements to argue that PA is inconsistent. This work is flawed, but is interesting that a world-class mathematician is seriously interested in showing something that few working mathematicians believe. The work ends with an afterword by Sam Buss and Terry Tao, in which they say:
We of course believe that Peano arithmetic is consistent; thus we do not expect that Nelson’s project can be completed according to his plans. Nonetheless, there is much new in his papers that is of potential mathematical, philosophical and computational interest. For this reason, they are being posted to the arXiv. Two aspects of these papers seem particularly useful. The first aspect is the novel use of the “surprise examination” and Kolmogorov complexity; there is some possibility that similar techniques might lead to new separation results for fragments of arithmetic. The second aspect is Nelson’s automatic proof-checking via TeX and qea. This is highly interesting and provides a novel method of integrating human-readable proofs with computer verification of proofs.
Nelson’s criticism of PA is well summarized in this talk by Buss, while it was Tao who articulated the flaw in Nelson’s particular Kolmogorov complexity argument for inconsistency, which we also covered here.
Our idea is to examine consistency from a computer science viewpoint and use this to make a weaker notion: a notion that can be proved and avoid the Gödel limit. The idea is the following:
Can we prove that PA is consistent at least for any proof that we are likely to ever see?
We can make this precise via the following simple notion, :
Note that we mean the length of the proof in symbols, not steps; we could alternately treat as a number and state the bounded quantifier as .
Clearly for any we can determine whether or not is true or not. Of course as grows the cost of checking that all proofs of length at most in binary explodes. Yet there is no Gödel limit on this checking. We want to understand the following question:
Can we check that is true for large ?
Let’s take a look at this next.
Given a Boolean string of length we can check that it encodes a correct proof from PA in time nearly linear in . We need only check that steps are either examples of an axiom or following from the usual rules of inference of a first order logic. Suppose for the rest we assume that this checking can actually be done in linear time: we are throwing out log terms, which will just help us avoid technically more complex statements. This assumption will not change anything in an important way.
This assumption shows that can be checked in time for some constant . This means that we cannot hope to check this for even modest size ‘s. But can we check it much faster? If we could check, for example, that is true, then we would know that all proofs of at most 1,000,000 bits do not led to contradictions. This covers all proofs that most of us will ever write down or even read. Note that the following is true:
Theorem 1 PA can prove for any fixed .
Define as the length of the shortest proof in PA of . Clearly, is well defined—whether PA is consistent or not. A question is how slow can grow? Or looking at it another way, can there be short proofs that is true? We can also ask this when for some function such as .
Note that one can prove theorems like this:
Theorem 2 If then we can prove in time polynomial in .
This would allow us, at least in principle, to check huge length proofs. There are other complexity open problems that would allow us to even check much larger proof lengths.
We want, however, to be very concrete. We really want to know about more than what happens for large . Incidentally, Gödel used the same number 1,000,000 as a factor in his brief and cryptic paper (see this review) titled “On the Lengths of Proofs.” What he implicitly did there was apply the same method as in his incompleteness theorems to create formulas expressing,
There does not exist a PA proof of of length at most .
He meant length to be the number of lines, but Wikipedia’s article on it speaks of symbol length. Interpreted either way, the truth of the formulas is obvious: they don’t have proofs in PA of length . They have proofs of length by exhaustive enumeration as above. If then they have shorter proofs asymptotically—indeed this was the context of Gödel’s insight about in his famous 1956 letter to John von Neumann. But again we want to be concrete.
Note that in a “meta” sense we already proved . We know how Gödel’s construction works in general from reading only a finite piece of his proof. Then as we said its truth is obvious. We used symbols to write down but we can ignore that. So this is really a finite proof, certainly well under 1,000,000 symbols.
This is exactly what Gödel meant about the length-saving effect. The rub however is that our “meta” proof is assuming the consistency of PA—that is, the shorter proof is in the stronger theory . Gödel asserted (without proof) that the same effect can always be had by progressing to the next higher order of logic.
But going back to our sentences , clearly assuming is silly and we want to stick with first-order logic. We can consider changing the rules of PA but not that way. So our query becomes:
Is there a formal logic , that avoids the criticisms of PA referenced above and does not obviously entail the consistency of PA, such that can economically prove ?
We can alternatively talk about programs that construct proofs and shift the question to the complexity—concrete or asymptotic—of verifying that is correct. This could lead into questions about the (resource-bounded) Kolmogorov complexity of proofs. Assuming really is true, the simple enumeration argument describes a proof with Kolmogorov complexity well under 1,000,000 symbols—but more than 1,000,000 steps would be needed to expand it. The verification angle, however, may even apply in cases of “insanely long proofs.” Other formal or “fromal” approaches might be considered.
There was a flurry of work in this area for a decade-plus after Sam Buss’s innovative work in the 1980s connecting complexity questions to proofs in bounded arithmetics—that is with restrictions on the PA induction axioms or the underlying logic. We talked about his work here and here. For connections to lengths of proofs Sam himself has written two nice surveys and here is another by Pavel Pudlák. The connections extend to the Natural Proofs barrier against circuit lower bounds.
But as with many approaches to core questions in complexity, progress seems to have slowed. Perhaps it is because we haven’t been following as closely. So we are asking for news and opinions on what is important. And this is also why we are talking here about changing the questions and the rules of the game.
We have turned up some recent work on questions like ours that changes the rules. Martin Fischer wrote a 2014 paper titled “Truth and Speed-Up” and another titled “The Expressive Power of Truth.” The object of the latter is to find “natural truth theories which satisfy both the demand of semantical conservativeness and the demand of adequately extending the expressive power of our language.” The former shows that other theories including one called “,” while syntactically but not semantically conservative, give speedups for proving .
We are not sure how to assess these results. The theory stands for “Positive Truth with internal induction for total formulas” and is studied further in this 2017 paper. The emphasis seems however to be philosophical, relating to the effect of allowing truth to be a predicate. We don’t know how to connect these ideas to complexity questions but they do show scope for further action.
There are several open problems. Can we show that or similar statements are equivalent to results about how far out we can check consistency? That is, if we could check that PA has no short contradictions, does this imply anything about complexity theory?
Another class of questions is: Is it useful to know that is true? Does the fact that there is no short contradiction help make one believe in PA as a useful tool? I am not sure what to make of this. What do you think? What would Gödel think?
]]>
To give a Hilldale Lecture and learn about fairness and dichotomies
UB CSE50 anniversary source |
Jin-Yi Cai was kind enough to help get me, Dick, invited last month to give the Hilldale Lecture in the Physical Sciences for 2017-2018. These lectures are held at The University of Wisconsin-Madison and are supported by the Hilldale Foundation. The lectures started in 1973-1974, which is about the time I started at Yale University—my first faculty appointment.
Today Ken and I wish to talk about my recent visit, discuss new ideas of algorithmic fairness, and then appreciate something about Jin-Yi’s work on “dichotomies” between polynomial time and -completeness.
The Hilldale Lectures have four tracks: Arts & Humanities and Physical, Biological, and Social Sciences. Not all have a speaker each year. The Arts & Humanities speaker was Yves Citton of the University of Paris last April, and in Social Sciences, Peter Bearman of Columbia spoke on Sept. 28, three weeks before I. Last year’s speaker in the Physical Sciences track was Frank Shu of Berkeley and UCSD on the economics of climate change. Before him came my former colleagues Peter Sarnak and Bill Cook. Dick Karp was invited in 2004, and mathematicians Barry Mazur and Persi Diaconis came between him and Cook.
I am delighted and honored to be in all this company. We all may not seem to have much to do with the physical sciences but let’s see. I spoke on () a subject for another time—let this post be about my hosts.
The other highlight of my visit was meeting with the faculty of the CS department and also with the graduate students. I hope they enjoyed our discussions as much as I did.
One topic that came up multiple times is the notion of fair algorithms. This is a relatively new notion and is being studied by several researchers at Wisconsin. The area has its own blog. The authors of that blog (one I know well uses his other blog’s name as his name there—are we to become blogs?) also wrote a paper titled, “On the (Im)Possibility of Fairness,” whose abstract we quote:
What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the “observed” space) and outputs (the “decision” space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction. We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations.
There is a notion of fairness in distributed algorithms but this is different. The former is about the allocation of system resources so that all tasks receive due processing attention. The latter has to do with due process in social decision making where algorithmic models have taken the lead. Titles of academic papers cited in a paper by three of the people I met in Madison and someone from Microsoft (see also their latest from OOPSLA 2017) speak why the subject has arisen:
Also among the paper’s 29 references are newspaper and magazine articles whose titles state the issues with less academic reserve: “Websites vary prices, deals based on users’ information”; “Who do you blame when an algorithm gets you fired?”; “Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks”; “The dangers of letting algorithms enforce policy.” Evan a May, 2014 statement by the Obama White House is cited.
Yet also among the references are papers familiar in theory: “Satisfiability modulo theories”; “Complexity of polytope volume computation” (by Leonid Khachiyan no less), “On the complexity of computing the volume of a polyhedron”; “Hyperproperties” (by Michael Clarkson and Fred Schneider), “On probabilistic inference by weighted model counting.” What’s going on?
What’s going on can be classed as a meta-example of the subject’s own purpose:
How does one formalize a bias-combating concept such as fairness without instilling the very kind of bias one is trying to combat?
We all can see the direction of bias in the above references. You might think that framing concepts to apply bias in the other direction might be OK but there’s a difference. Bias in a measuring apparatus is more ingrained than bias in results. What we want to do—as scientists—is to formulate criteria that are framed in terms apart from those of the applications in a simple, neutral, and natural manner. Then we hope the resulting formal definition distinguishes the outcomes we desire from those we do not and stays robust and consistent in its applications.
This is the debate—at ‘meta’ level—that Ken and I see underlying the two papers we’ve highlighted above. We blogged about Kenneth Arrow’s discovery of the impossibility of formalizing a desirable “fairness” notion for voting systems. The blog guys don’t find such a stark impossibility theorem but they say that to avoid issues with analyzing inputs and outcomes, one has to attend also to some kind of “hidden variables.” The paper by Madison people tries to ground a framework in formal methods for program verification, which is connects to probabilistic inference via polytope volume computations.
Many other ingredients from theory can be involved. The basic idea is determining sensitivity of outcomes to various facets of the inputs. The inputs are weighted for relevance to an objective. Fairness is judged according to how well sensitivity corresponds to relevance and also to how the distribution of subjects receiving favorable decisions breaks according to low-weight factors such as gender. Exceptions may be made by encoding some relations as indelible constraints—the Madison plus Microsoft paper gives as an example that a Catholic priest must be male.
Thus we see Boolean sensitivity measures, ideas of juntas and dictators, constraint satisfaction problems, optimization over polytopes—lots of things I’ve known and sometimes studied in less-particular contexts. My Madison hosts brought up how Gaussian distributions are robust for this analysis because they have several invariance properties including under rotations of rectangles. This recent Georgetown thesis mixes in even more theory ideas. The meta-question is:
Can all these formal ingredients combine to yield the desired outcomes in ways whose scientific simplicity and naturalness promote confidence in them?
Thus wading in with theory to a vast social area like this strikes us as a trial of “The Formal Method.” Well, there is the Hilldale Social Sciences track…
Did “hidden variables” bring quantum to your mind? We are going there next, with Ken writing now.
We covered Jin-Yi’s work in 2014 and 2012 and 2009. So you could say in 2017 we’re “due” but we’ll take time for more-topical remarks. All of these were on dichotomy theorems. For a wide class of counting problems in his dichotomy is that every problem in either belongs to polynomial time or is -complete. There are no in-between cases—that’s the meaning of dichotomy.
Jin-Yi’s answer to a question last month between Dick and me recently brought home to us both how wonderfully penetrating and hair-trigger this work is. Dick had added some contributions to the paper covered in the 2009 post and was included on the paper’s final 2010 workshop version. Among its results is the beautiful theorem highlighted at the end of that post:
Theorem 1 There is an algorithm that given any and formula for a quadratic polynomial over computes the exponential sum
exactly in time that is polynomial in both and . Here and means the number of arguments on which takes the value modulo .
That the time is polynomial in not is magic. We can further compute the individual weights by considering also the resulting expressions for . Together with they give us equations in the unknowns in the form of a Vandermonde system, which is always solvable. Solving that takes time polynomial in , though, and we know no faster way of computing any given .
When is fixed, however, polynomial in is all we need to say. So the upshot is that for any fixed modulus, solution counting for polynomials of degree is in . Andrzej Ehrenfeucht and Marek Karpinski proved this modulo primes and also that the solution-counting problem for degree is -complete even for . So the flip from to -complete when steps from to is an older instance of dichotomy. The newer one, however, is for polynomials of the same degree .
I wrote a post five years ago on his joint work with Amlan Chakrabarti for reducing the simulation of quantum circuits to counting solutions in . One motive is which subclasses of quantum circuits might yield tractable cases of counting. The classic—pun intended—case is the theorem that all circuits of so-called Clifford gates can be simulated in classical polynomial time (not even randomized). I observed that such circuits yield polynomials over that are sums of terms of the form
These terms are invariant on replacing by or by modulo . Hence for such there is an exactly -to- correspondence between solutions in and those in . Since counting the latter is in by Theorem 1, the theorem for Clifford gates follows.
Adding any non-Clifford gate makes a set that is universal—i.e., has the full power of . The gates I thought of at the time of my post all bumped the degree up from to or more. A related but different representation by David Bacon, Wim van Dam, and Alexander Russell gives a dichotomy of linear versus higher degree. The controlled-phase gate, however, is non-Clifford but in my scheme produces polynomials as sums of terms of the form
Those are quadratic too, so Theorem 1 counts all the solutions in polynomial time. Does this make ? The hitch is that quantum needs counting binary solutions—and having not defeats the above exact correspondence.
I thought maybe the counting problem for quadratic-and-binary could be intermediate—perhaps at the level of itself. But Jin-Yi came right back with the answer that his dichotomy cuts right there: this 2014 paper with his students Pinyan Lu and Mingji Xia has a general framework for CSPs that drops down to say the problem is -complete. A more-recent paper of his with Heng Guo and Tyson Williams lays out the connection to Clifford gates specifically, proving an equivalence to the condition called “affine” in his framework which renders counting tractable. Thus the state of play is:
Thus the difference between easy and general quantum circuits hangs on that ‘‘—a coefficient, not an exponent—as does factoring (not?) being in . Of course this doesn’t mean quantum circuits are -complete—they are generally believed not to be even -hard—but that saying has terms of the form and captures ostensibly more than .
Might there be some other easily-identified structural properties of the produced (say) by circuits of Hadamard and controlled-phase gates that make the counting problem intermediate between and , if not exactly capturing ? Well, the dichotomy grows finer and richer and stronger with each new paper by Jin-Yi and his group. This feeds in to Jin-Yi’s most subtle argument for believing as we said it here, echoing reasons expounded by Scott Aaronson and others for : if the classes were equal there would be no evident basis for our experiencing such fine and sharp distinctions.
Trying to make up for blog pause while we were busy with a certain seasonal event, we offer several open problems:
[“stabilizer theorem”->”theorem for Clifford gates”; added links to OOPSLA 2017 paper and May 2017 Cai-Guo-Williams paper]
]]>New York Times obituary source |
Lotfi Zadeh had a long and amazing life in academics and the real world. He passed away last month, aged 96.
Today Ken and I try to convey the engineering roots of his work. Then we relate some personal stories.
Zadeh was a Fellow of the ACM, the IEEE, the AAAI, and the AAAS and a member of the NAE. But besides this alphabet soup of US-based academies, we are impressed with the one he co-founded: the Eurasian Academy. His founding partners were a historian, a neurosurgeon, a music composer, and a mathematician. They recently elected three other members: an actress-screenwriter-director, an actor-director-writer, and a physicist.
In any alphabet of his life, one letter stands out: the letter Z. The term “Fuzzy Set” has two of them. But Zadeh’s first widely noted work goes by just the bare letter.
Pierre-Simon Laplace discovered a relative of the Fourier transform that has similarly motivated applications and often better behavior. When applied to the density function of a random variable on or , it has the form
Here can be a complex number. The function is holomorphic provided we are working on . A neat trick is that we can jump from to the cumulative distribution by
Can we get such nice properties for a discrete random variable on the integers? Zadeh’s advisor at Columbia, John Ragazzini, led him in showing the power of defining
where again can be any complex number, and the domain of and the sum can be or . We note that is often defined as a function of , that a similar sign issue was discussed in reviewing the 1952 Ragazzini-Zadeh paper, and we’ve switched versus in Wikipedia’s article on to make it look more like . With a positive exponent, is the probability generating function of .
How useful is this? Much of what we can say in a short space is the same as with Fourier: If we form the convolution
then its -transform is just the product function:
Using products this way makes convolutions easier to work with. Many hard-to-handle functions become nicer under their -transforms. The Dirac delta function if and otherwise is strange at face value—though it can be understood as the random variable whose outcome is always . Under the -transform, however,
Nothing can be nicer than the constant . For explanation of where is more general than the discrete Fourier transform and relatives we defer to this beautiful page. All this grew out of ideas in the 1940s by others including Witold Hurewicz—another z—but Zadeh’s joint paper had the greatest influence in signal processing.
The art of is continuous functions forming a well-behaved nimbus around certain discrete entities. Suppose we try to do this for every discrete concept? Begin with the idea of a set , namely a subset of some universe . Instead, let us think of a fuzzy set where
Here the real number is called the grade of memebership of in . The original set is the case if and otherwise. The point is that we are now free to consider other functions that approximate and are smoother and nicer to work with. We can consider whole ensembles of such functions.
From fuzzy sets it is a short step to fuzzy logic. This has an antecedent: the infinite-valued logic of Jan Łukasiewicz and others. A statement may have a truth value between 0 and 1. A common choice is to represent the value by a logistic curve of a main parameter. Here is a somewhat distorted curve for the statement “X is wealthy” parameterized by the net worth of X:
“Simulating Complexity” blog source |
The point for us is that logistic curves are natural to work with when modeling such predicates in a larger system. Here is a pertinent recent example for image processing. Further points are that a logical 0-1 assignment to “wealthy” would have an artificially sharp distinction somewhere and that the logistic curves are more faithful to neural-net models of how we think.
Zadeh’s original 1965 paper is one of the most cited science papers of all time. It has close to citations. He confessed that:
“I knew that just by choosing the label ‘fuzzy’ I was going to find myself in the midst of a controversy… If it weren’t called fuzzy logic, there probably wouldn’t be articles on it on the front page of the New York Times. So let us say it has a certain publicity value. Of course, many people don’t like that publicity value, and when they see it in the New York Times, it doesn’t sit well with them.”
That controversy was real—see the next section. Zadeh in an acceptance speech for the 1989 Honda Foundation prize said
“The concept of a fuzzy set has had an upsetting effect on the established order.”
I (Dick) never understood why this generalization of sets created such push-back. Stuart Russell, a Berkeley professor who worked next door to Mr. Zadeh for many years, noted:
He always took criticism as a compliment. It meant that people were considering what he had to say.
The impact of his work has been recognized by a posthumous “Golden Goose” Award. The award’s name counters the stigma of the “Golden Fleece” awards given out in 1975–1988 by US Senator William Proxmire in half-jest to federally-funded research projects he deemed frivolous and wasteful. Zadeh drew attention from Proxmire as a potential “Golden Fleece” awardee. The Golden Goose citation, however, describes the “Clear Impact,” especially as seen by engineering-minded Japanese:
Part of this interest came from the fact that ‘fuzzy’ was not a pejorative term in Japanese, but instead a neutral or even positive one. Researchers there took his idea and ran, creating conferences and journals focused on making advances in fuzzy logic. To this day, the only country with more patents on fuzzy ideas and concepts than the United States is Japan.
In 1986, the first commercial application of fuzzy logic hit the shelves in Japan: a fuzzy shower head. Using fuzzy concepts of hot, cold, high pressure, low pressure, and others, the shower head could use fuzzy logic to control showers across the country. Within a few years, the market was overflowing with fuzzy consumer products. Vacuum cleaners, rice cookers, air conditioning systems, microwaves, everything was moving to fuzzy control. Even the entire subway system of Sendai in Japan was built with fuzzy logic controlling the motion of the trains.
Way back in the first month of this blog, I (Dick) quoted the following remarks by William Kahan. I was in the audience for Zadeh’s lecture too but let’s let Kahan speak:
My two favorite stories about him concern his tremendous candor. The first is about his ideas on “fuzzy sets” and the second is on “who should get tenure.” I will only tell the first one—to protect the innocent and the guilty. When I first arrived at the Computer Science Department at Berkeley, the faculty decided to have a new series of lectures that fall. The plan was to have short lectures by each faculty of the department—in this way new graduate students would learn each faculty’s research area.
One day Professor Zadeh was presenting his area of research—an area that he created called “fuzzy sets.” Fuzzy sets were then and still are today a controversial area. Some researchers do not think much of this area. However, the area is immensely popular to many others. There are countless conferences, books, and journals devoted completely to this area. Kahan was in the audience while Zadeh was speaking. Finally, at some point Kahan could take it no more. He stood up and Zadeh asked him what his question was. Kahan stated in the most eloquent manner that it might be okay to work on fuzzy sets in the privacy of your own basement (after all this was Berkeley), but there was no excuse for exposing young minds to this “stuff”—his term was stronger. We all were shocked. For a few seconds no one spoke. I wondered how in the world Zadeh could respond. Zadeh finally said, “thank you for your comments,” and went on with the rest of talk, as if nothing had happened. The next year the faculty talks were cancelled.
I met Zadeh once, when he was the featured speaker at the 6th International Conference on Computing and Information (ICCI 1994), which was held at Trent University in Peterborough, Ontario, May 26–28, 1994. Jie Wang and I drove there from STOC which was held in Montreal that year. This small conference—not to be confused with ones having similar names and acronyms—lasted just a few more years. It is hard to find any information on the 1994 meeting now—just a few paper citations—and I have found no proof on the Internet that Zadeh was there. But he was—in a non-fuzzy but decidedly freezy setting.
There was a welcoming reception in the late afternoon of the 25th. It was slated to be outside in a wooded park on the university grounds. It was late May after all. But it was cold. I’ve known cold days in May in Buffalo, but none like that—biting wind and icy sleet. Only twenty or so of the registrants braved the weather. There was fortunately a round wooden structure, covered and enclosed and large enough to shelter us, but with no central heating. Instead it had a coal heat stove. We huddled around on chairs and stools and the part of the circular wall bench near the stove. Although over two hours of nominal daylight remained, the dark clouds and scant windows made it pitch night inside. If I recall correctly, the original intent of a cookout was shelved and replaced by a bulk order of sandwiches and potato chips and other picnic fare.
Nearest the stove sat the 73-year-old Zadeh wrapped in blankets. His face glowed orange as he regaled us in good humor with stories. I don’t think I kept any record of what he said. We felt in the presence of a great man but under surreal conditions—accentuated for Jie and me by our having had a hot lunch in the downtown Montreal hotel for STOC. Somewhere I do have notes of the keynote he gave the next morning before departing—in a modern and heated university lecture room—but I have not unpacked my boxes of old notebooks since my department’s move to a new building six years ago.
His birthplace Baku has been on my mind because I’ve recently read Thomas Reiss’s biography The Orientalist of Lev Nussimbaum, who wrote under the pseudonyms Essad Bey and Kurban Said. Nussimbaum had at least a hand in the writing and production of the 1937 romance Ali and Nino, which is considered the national novel of Azerbaijan. Baku juts into the Caspian Sea and calls itself the easternmost European city as demarked by the Urals and Asia Minor extended east. I wish it had occurred to me to ask about his upbringing and the history between the wars.
We convey our profound appreciation and regrets to his family and friends.
]]>
Two more tragic losses coming before a greater tragedy
Composite of crops from src1, src2 |
Michael Cohen and Vladimir Voevodsky were in different stages of their careers. Cohen was a graduate student at MIT and was visiting the Simons Institute in Berkeley. He passed away suddenly a week ago Monday on a day he was scheduled to give a talk. Voevodsky won a Fields Medal in 2002 and was a professor at the Institute for Advanced Study in Princeton. He passed away Saturday, also unexpectedly.
Today we join those grieving both losses.
We are writing this amid the greater horror in Las Vegas. Dick and I speak our condolences and more, but the condolences that two of us can give seem to fade—they do not “scale up.” Hence we feel that the best we can do is talk about Cohen’s and Voevodsky’s roles in our scientific communities and some of what they contributed. That is a gesture of peace and serenity. It may not overcome the darkness, but something like it seems needed so that we all might do so.
Michael Cohen had already worked with a wide variety of people in over twenty joint papers. He had two all by himself: a paper at SODA 2016 titled, “Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities,” and a paper at FOCS 2016 titled, “Ramanujan Graphs in Polynomial Time.”
A common theme through much of this work was wizardry with special kinds of matrices. They included Laplacian matrices in which every column sums to and only the diagonal entries can be positive. You can get one from a directed graph by negating the entries of its adjacency matrix and putting the in-degrees on the diagonal. One can further demand that the rows sum to zero, which happens for our graph if each node’s in-degree equals its out-degree. This is automatic for undirected graphs. As noted in this paper:
While these recent algorithmic approaches have been very successful at obtaining algorithms running in close to linear time for undirected graphs, the directed case has conspicuously lagged its undirected counterpart. With a small number of exceptions involving graphs with particularly nice properties and a line of research in using Laplacian system solvers inside interior point methods for linear programming […], the results in this line of research have centered almost entirely on the spectral theory of undirected graphs.
The paper, titled “Faster Algorithms for Computing the Stationary Distribution, Simulating Random Walks, and More,” was joint with Jonathan Kelner, John Peebles, and Adrian Vladu of MIT, Aaron Sidford of Stanford, and Richard Peng of Georgia Tech, and also came out at FOCS 2016. In the case of symmetric matrices , needing only that , he was part of a bigger team including Peng and Gary Miller of CMU that found the best-known time for solving . That paper came out at STOC 2014.
Thus from early on he was working with a great many people in the community. This has been noted in tribute posts by Scott Aaronson, by Sébastien Bubeck, by Luca Trevisan, by former colleagues at Microsoft Research where Cohen spent this past summer, and by Lance Fortnow. The post by Scott includes communications from Cohen’s parents and information about memorials and donations.
We’ll talk about Cohen’s paper on Ramanujan graphs in a train of thought that will lead into aspects of Voevodsky’s work. Of course we know Srinivasa Ramanujan was a brilliant Indian mathematician who also died tragically young.
In mathematics we sometimes prove the existence of objects without knowing how to construct them. Sometimes we can prove that a random object works. This is often helpful, but one downside comes from cases where we would want different people given the same problem parameters to obtain the same object. Randomized algorithms usually do not usually have a single output that is arrived at with high probability. What we really want is an algorithm that constructs the object.
This has been the story for a long time with expander graphs. They were proved to exist long ago via the probabilistic method. The zig-zag product was a watershed in constructing some kinds of them. The goal is to get these objects constructively with the same parameters or close to them.
A Ramanujan graph is a particular kind of expander with a maximum dose of the spectral-gap condition for expansion. The adjacency matrix of a -regular graph has as its largest eigenvalue. It cannot have an eigenvalue less than , which occurs if and only if the graph is bipartite. The graph is Ramanujan if all other eigenvalues have absolute value at most . This creates a quadratic spectral gap between and the next-largest eigenvalue and this is asymptotically the largest possible.
Again, a randomly chosen -regular -node graph will be almost certainly a Ramanujan graph, for any and nontrivial . Adam Marcus, Dan Spielman, and Nikhil Srivastava (MSS) proved in 2013 that such graphs exist for all and even when required to be bipartite. But can we build one for any and ? This was not known in deterministic time until Cohen’s paper. The main advance was to use a beautiful concept of trees of degree- polynomials with interlacing roots from MSS and improve it so that the requisite trees have polynomial rather than exponential maximum branch length, which governs the time of the algorithm. The paper well rewards further reading.
What this does is put bipartite Ramanujan graphs onto the list of structures that we can apprehend and use in deterministic polynomial-time algorithms. Thus Cohen added his name to the honor roll of those constructing good expanders and making random objects concrete.
Voevodsky’s work is set against a backdrop where mathematicians do the following over and over again. They start by knowing how to build certain kinds of algebraic structures on, say, differential manifolds or curves. They then want to carry this structure over to more general settings.
Voevodsky won his Fields Medal for this kind of work. He showed how to carry over topological ideas of homotopy from differential manifolds to algebraic manifolds—that is, any manifold that is the zero set of a polynomial. We discussed homotopy and its computational relevance in our own terms here. To quote his 2002 Fields review by Christophe Soulé:
It is quite extraordinary that such a homotopy theory of algebraic manifolds exists at all. In the fifties and sixties, interesting invariants of differentiable manifolds were introduced using algebraic topology. But very few mathematicians anticipated that these “soft” methods would ever be successful for algebraic manifolds. It seems now that any notion in algebraic topology will find a partner in algebraic geometry
Voevodsky’s medal was also for his proof of a noted conjecture by John Milnor that a structure of algebraic groups he built on a field of characteristic other than 2, with the algebra taken mod 2, would be isomorphic to an étale cohomology of with coefficients mod 2. Voevodsky overcame difficulty with tools from algebraic -theory by developing and systematizing prior ideas of motivic cohomology that, as the review says, “turned out to be more computable.” He later proved the general conjecture for moduli other than 2, drawing on work by others in the meantime.
In the most ambitious cases of such “carry-overs,” however, mathematicians are able to prove that the objects needed for such structure exist but not concretely. It’s not just that the objects cannot be apprehended, but that these proofs are often not subject to being algorithmically checked.
To remedy this, Voevodsky delved deeper into constructive mathematics, which aims not to limit knowledge but rather to streamline and solidify it. He built up homotopy type theory (HoTT), which we talked about here. His ideas were programmed in the software system Coq, which grew out of Thierry Coquand’s “calculus of constructions” in partnership with Gérard Huet. Thus he was led to consider the foundations of mathematics as deeply as David Hilbert did a century ago.
The term “foundations,” which lives in the names of conferences such as FOCS and MFCS, tends to be spoken as an umbrella term for “theory.” We have argued that it ought to mean continued and concerted attention to the core problems in our field like —notwithstanding that many of them are “like” in the sense of not having budged for decades. But when Voevodsky talked about foundations, he really meant the foundations: how do we know the whole edifice we have built out of proofs—all kinds of proofs—won’t collapse?
We have blogged about Ed Nelson’s attempts to show that Peano Arithmetic is inconsistent. Voevodsky took this possibility seriously. In memorials to Voevodsky on the HoTT Google Group, André Joyal contributed the following:
My first contact with Vladimir and his ideas was at a meeting in Oberwolfach in 2011. He gave a series of talks on constructive mathematics and homotopy theory, framed as a tutorial with the proof assistant Coq. His notion of a contractible object and of an equivalence were striking. I had a hard time understanding his ideas, because they were described very formally. He apparently distrusted informal expressions of mathematical ideas. One evening, he expressed the opinion that Peano arithmetic was inconsistent! He later came to distrust the applications of his ideas to homotopy theory!
Voevodsky indeed gave a talk at IAS titled, “What If Current Foundations of Mathematics are Inconsistent?” Very controversially, it tries to turn the understanding of Kurt Gödel’s Second Incompleteness Theorem on its head as a vehicle for possibly proving the inconsistency of certain classical first-order theories. Whereas, he concluded:
In constructive type theory, even if there are inconsistencies, one can still construct reliable proofs using the following “workflow”:
- A problem is formalized.
- A solution is constructed using all kinds of abstract concepts. This is the creative part.
- An algorithm which verifies “reliability” is applied to the constructed solution (e.g., a proof). If this algorithm terminates then we know we have a good solution of the original problem. If not, then we may have to start looking for another solution.
The workflow on this effort will continue. The IAS announcement notes that a memorial workshop is being planned and more information will be available soon. Update 10/7: The IAS posted a full obituary and there are also one in today’s New York Times and one in today’s Washington Post.
Again we express our condolences to their family, loved ones, and colleagues, and the same to everyone affected by the horror in Las Vegas.
This is the 750th post on this blog. We were holding onto two other ideas for marking this milestone, while busy with papers and much else these past two weeks ourselves. Those will still come out in upcoming weeks.
[added update]
]]>Kathryn Farley obtained her PhD from Northwestern University in performance studies in 2007. After almost a decade working in that area, she has just started a Master’s program at New York University in a related field called drama therapy (DT).
Today, I thought I would talk about the math aspects of DT.
Okay so why should I report on DT here? It seems to have nothing in common with our usual topics. But I claim that it does, and I would like to make the case that it is an example of a phenomenon that we see throughout mathematics.
So here goes. By the way—to be fair and transparent—I must say that I am biased about Dr. Farley, since she is my wonderful wife. So take all I say with some reservations.
The whole point is that understanding what DT is hard, at least for me. But when I realized that it related to math it became much clearer to me, and I hope that it may even help those in DT to see what they do in a new light. It’s the power of math applied not to physics, not to biology, not to economics, but to a social science. Perhaps I am off and it’s just another example of “when you have a hammer, the whole world looks like a nail.” Oh, well.
I asked Kathryn for a summary of DT and here it is:
Drama therapy uses methods from theatre and performance to achieve therapeutic goals. Unlike traditional “talk” therapy, this new therapeutic method involves people enacting scenes from their own lives in order to express hidden emotions, gain valuable insights, solve problems and explore healthier behaviors. There are many types of DT, but most methods rely on members of a group acting as therapeutic agents for each other. In effect, the group functions as a self-contained theatre company, playing all the roles that a performance requires—playwright, director, actors, stagehands, and audience. The therapist functions as a producer, setting up the context for each scene and soliciting feedback from the audience.
Kathryn’s summary of DT is clear and perhaps I should stop here and forget about linking it to math. But I think there is a nice connection that I would like to make.
Since Kathryn is a student again, and students are assigned readings—there is a lot of reading in DT—you may imagine that she has been sharing with me a lot of thoughts on her readings and classes. I have listened carefully to her, but honestly it was only the other day, in a cab going to Quad Cinema down on 13th St., that I had the “lightblub moment.” I suddenly understood what she is studying. Perhaps riding in a cab helps one listen: maybe that has been studied before by those in cognitive studies.
What I realized during that cab ride is that DT is an example of a generalization of another type of therapy. If the other therapy involves people–including the therapist—then DT is the generalization to . We see this all the time in math, but it really helped me to see that the core insight—in my opinion—is that DT has simply moved from to or more.
We this type of generalization all the time in math. For example, in communication complexity the basic model is two players sending each other messages. The generalization to more players creates very different behavior. Another example is the rank of a matrix. This is a well understood notion: easy to compute and well behaved. Yet simply changing from a two-dimensional matrix to a three-dimensional tensor changes everything. Now the behavior is vastly more complex and the rank function is no longer known to be easy to compute.
Here is an example of how DT could work—it is based on a case study Kathryn told me about.
Consider Bob who is seeing Alice who is Bob’s therapist. Alice is trained in some type of therapy that she uses via conversations with Bob to help him with some issue. This can be very useful if done correctly.
What DT is doing in letting be or more is a huge step. We see this happen all the time in mathematics—-finally the connection. Let’s look at Bob and Alice once more. Now Alice is talking with Bob about an issue. To be concrete let’s assume that Bob’s issue is this:
Bob has been dating two women. His dilemma is, which one should he view as a marriage prospect? He thinks both would go steady with him but they are very different in character. Sally is practical, solid, and interesting; Wanda is interesting too but a bit wild and unpredictable. Whom should he prefer?
The usual talk therapy would probably have Alice and Bob discuss the pros and cons. Hopefully Alice would ask the right question to help Bob make a good decision.
The DT approach would be quite different. Alice would have at least one other person join them to discuss Bob’s decision. This would change the mode from direct “telling” to a more indirect story-line. In that line it might emerge that Bob’s mother is a major factor in his decision—even though she passed away long ago. It might come out that his mom divorced his dad when he was young because he was too staid and level-headed. Perhaps this would make it clear to Bob that his mother was really the reason he was even considering Wanda, the wild one.
What is so interesting here is that by using more that just Bob, by setting , Alice can make the issues much more viivid for Bob.
The more I think about it, the idea of people involved is the root. Naturally anything with more than two people transits from dialogue to theater. So the aspect of `drama’ is not primordial—it is emergent. Once you say , what goes down as Drama Therapy in the textbooks flows logically and sensibly—at least it does to me now.
This is accompanied by a phase change in complexity and richness. As such it parallels ways we have talked about mathematical transitions from the case of to on the blog before. Maybe DT even implements a strategy I heard from Albert Meyer:
Prove the theorem for and then let go to infinity.
Does this connection help? Does it make any sense at all?
]]>
It was just Ken’s birthday
Kenneth Regan’s birthday was just the other day.
I believe I join all in wishing him a wonder unbirthday today.
The idea of unbirthday is due to Lewis Carroll in his Through the Looking-Glass: and is set to music in the 1951 Disney animated feature film Alice in Wonderland. Here is the song:
MARCH HARE: A very merry unbirthday to me
MAD HATTER: To who?
MARCH HARE: To me
MAD HATTER: Oh you!
MARCH HARE: A very merry unbirthday to you
MAD HATTER: Who me?
MARCH HARE: Yes, you!
MAD HATTER: Oh, me!
MARCH HARE: Let’s all congratulate us with another cup of tea A very merry unbirthday to you!
MAD HATTER: Now, statistics prove, prove that you’ve one birthday
MARCH HARE: Imagine, just one birthday every year
MAD HATTER: Ah, but there are three hundred and sixty four unbirthdays!
MARCH HARE: Precisely why we’re gathered here to cheer
BOTH: A very merry unbirthday to you, to you
ALICE: To me?
MAD HATTER: To you!
BOTH: A very merry unbirthday
ALICE: For me?
MARCH HARE: For you!
MAD HATTER: Now blow the candle out my dear And make your wish come true
BOTH: A merry merry unbirthday to you!
Ken is best known for work in theory and in particular in almost all aspects of complexity theory. But I wanted—in the spirit of an unbirthday—to point out that Ken is quite active in many other areas of computer science research. Here is one example that is joint with Tamal Biswas: Measuring Level-K Reasoning, Satisficing, and Human Error in Game-Play Data. We discussed it before here.
The problem is that Ken and Tamal want to be able to study levels of play in chess but are stalled currently by issues Ken raised last May in this blog. I wish them well in making strides to better understand how to model game play in chess that captures the notion of levels.
For me the following game created by Ayala Arad and Ariel Rubinstein really helps me understand the kind of thing Ken and Tamal are interested in capturing.
You and another player are playing a game in which each player requests an amount of money. The amount must be (an integer) between 11 and 20 shekels. Each player will receive the amount he requests. A player will receive an additional amount of 20 shekels if he asks for exactly one shekel less than the other player. What amount of money would you request?
The point is there are levels of thinking that a player can naturally go through. Here is a quote from their paper that should give the flavor of what is going on: The choice of 20 is a natural anchor for an iterative reasoning process. It is the instinctive choice when choosing a sum of money between 11 and 20 shekels (20 is clearly the salient number in this set and “the more money the better”). Furthermore, the choice of 20 is not entirely naive: if a player does not want to take any risk or prefers to avoid strategic thinking, he might give up the attempt to win the additional 20 shekels and may simply request the highest certain amount.
Read the paper for how Arad and Rubinstein analyze the game. The trouble is that if you take a risk and select 19 then you at least have a chance to get the bonus 20: if you reason that your opponent is playing safe that is a great play. Of course if they reason the same way, then you lose one shekel. This type of “levels” of playing are central to many games including chess.
An example is that a move may have one refutation the computer at high depth can spot but otherwise bring higher returns than playing it safe with the computer’s “best” move. How can we judge when such moves can be expected to pay off? Risky opening `novelties’ have been tried many times in chess, and in one famous game where Frank Marshall had saved up a gambit for nine years, the human player José Capablanca did find the refutation at the board.
We all wish that Ken has many more birthdays and unbirthdays. We also hope he makes progress on his open problems about depth of thinking and levels of play. What should you select in the simple coin game?
]]>