Baku Olympiad source—note similarity to this |
Magnus Carlsen last week retained his title of World Chess Champion. His match against challenger Sergey Karjakin had finished 6–6 after twelve games at “Standard” time controls, but he prevailed 3–1 in a four-game tiebreak series at “Rapid” time controls. Each game took an hour or hour-plus under a budget of 25 minutes plus 10 extra seconds for each move played.
Today we congratulate Carlsen and give the second half of our post on large data being anomalous.
According to my “Intrinsic Performance Ratings” (IPRs), Carlsen played the tiebreak games as trenchantly as he played the standard games. I measure his IPR for them at 2835, though with wider two-sigma error bars +- 250 than the 2835 +- 135 which I measured for the twelve standard games. Karjakin, however, played the rapid games at a clip of 2315 +- 340, significantly below his mark of 2890 +- 125 for the regular match. The combined mark was 2575 +- 215, against 2865 +- 90 for the match. It must be said that of course faster chess should register lower IPR values. My preliminary study of the famous Melody Amber tournaments, whose Rapid sections had closely similar time controls, finds an overall dropoff slightly over 200 Elo points. Thus the combined mark was close to the expected 2610 based on the average of Carlsen’s 2853 rating and Karjakin’s 2772. That Carlsen beat his 2650 expectation, modulo the error bars, remains the story.
Carlsen finished the last rapid game in style. See if you can find White’s winning move—which is in fact the only move that avoids losing:
The win that mattered most, though, was on Thanksgiving Day when Carlsen tied up the standard match 5–5 with a 75-move war of attrition. The ChessGames.com site has named it the “Turkey Grinder” game. On this note we resume talking about some bones to pick over “Big Data.”
If you viewed the match on the official Agon match website, you saw a slider bar giving the probability for one side or the other to win. Or rather—since draws were factored in—the stands for the points expectation, which is the probability of winning plus half the probability of drawing. This is computed as a function of the value of the position from the player’s side. The beautiful fact—which we have discussed before in connection with a 2012 paper by Amir Ban—is that is an almost perfect logistic curve. Here is the plot for all available (AA) games at standard time controls in the years 2006–2015 with both players within 10 Elo points of the Elo 2000 level:
The “SF7d00” means that the chess program Stockfish 7 was run in Multi-PV mode to a variable depth between 20 and 30 ply. My scripts now balance the total number of positions searched so that endgame positions with fewer pieces are searched deeper. “LREG2” means the generalized logistic curve with two parameters. Using Wikipedia’s notation, I start with
and fix to symmetrize. Then is basically the chance of throwing away a completely winning game—and by symmetry, of winning a desperately lost game.
Chess programs—commonly called engines—output values in discrete units of called centipawns (cp). Internally they may have higher precision but their outputs under the standard UCI protocol are always whole numbers of cp which are converted to decimal for display. They often used to regard or as the value for checkmate, but has become standard. I still use as cutoffs and divide the -axis into “slots”
Positions of value beyond the cutoff belong to the end slots. Under a symmetry option, a position of value goes into both the slot for the player to move and the slot for the opponent. This is used to counteract the “drift” phenomenon discovered in this paper with my students that the player to move has a 2–3% lower expectation across all values—evidently because that player has the first opportunity to commit a game-chilling blunder.
The “b100” means that adjacent slots with fewer than 100 moves are grouped together into one “bucket” whose value is the weighted average of those slots. Larger slots are single buckets rather than divided into buckets of 100. The end slots and zero (when included) are single buckets regardless of size. Finally, the number after “sk” for “skedasticity” determines how buckets are weighted in the regression as I discuss further on.
The -value of a bucket is the sum of wins plus half of draws by the player enjoying the value (whose turn it might or might not be to move) divided by the size of the bucket. This is regressed to find most closely giving . The slope at zero is . The quantity
gives the expectation when ahead—figuratively the handicap at odds of a pawn. Note how close this is to 70% for players rated 2000.
The fit is amazingly good—even after allowing that the value, so astronomically close to , is benefiting from the correlation between positions from the same game, many having similar values. Not only does it give the logistic relationship the status of a natural law (along lines we have discussed), but also Ban argues that chess programs must conform to it in order to maximize the predictive power of the values they output, which transmutes into playing strength. The robustness of this law is shown by this figure from the above-linked paper—being rated higher or lower than one’s opponent simply shifts the curve left or right:
This is one of several reasons why my main training set controls by limiting to games between evenly-rated players. (The plots are asymmetric in the tail because they grouped buckets from up to rather than come in from both ends as the present ones do.)
Most narrowly to our goal, the value determines the scale by which increases in value translate into greater expectation, more directly than quantities like or . Put simplistically, if a program values a queen at 10 rather than 9, one might expect its “” to adjust by 9/10. Early versions of Stockfish were notorious for their inflated scale. The goal is to put all chess programs on a common scale by mapping all their values to points expectations—and Ban’s dictum says this should be possible. By putting sundry versions of Stockfish and Komodo and Houdini (which placed 2nd to Stockfish in the just-concluded ninth TCEC championship) on the same scale as my earlier base program Rybka 3, I should be able to carry over my model’s trained equations to them in a simple and direct manner. Here is the plot for Komodo 10’s evaluations of the same 100,000+ game positions:
The fit is just as fine. The values are small and equal to within so they can be dismissed. The values are for Komodo against for Stockfish, giving a ratio of about . The evaluations for 70% expectation, for Komodo and for Stockfish, have almost the same ratio to three decimal places. So we should be able to multiply Komodo’s values by 1.046 and plug them into statistical tests derived using Stockfish, right?
The error bars of on Komodo’s , which are two-sigma (a little north of “95% confidence”), give some pause because they have wiggle. This may seem small, but recall the also-great fit of the linear regression from (scaled) player error to Elo rating in the previous post. Under that correspondence, 2% error translates to 2 Elo points for every 100 below perfection—call that 3400. For Carlsen and Karjakin flanking 2800 that means only Elo but grows to for 2000-level players. Here is a footnote on how the “bootstrap” results corroborate these error bars and another data pitfall they helped avoid.
But wait a second. This error-bar caveat is treating Komodo’s as independent from Stockfish’s . Surely they are completely systematically related. Thus one should just be able to plug one into the other with the conversion factor and get the same proportions everywhere, right? The data is huge and both the logistic and ASD-to-Elo regressions this touches on have and the force of natural law. At least the “wiggle” can’t possibly be worse than these error bars say, can it?
Here are side-by-side comparison graphs with Stockfish and Komodo on the same set of positions played by players within 10 Elo points of 1750.
Now the Komodo is lower. Here is a plot of the -values for Komodo and Stockfish over all rating levels, together with the Komodo/Stockfish ratio:
The ratio waddles between 0.96 and 1.06 with a quick jag back to parity for the 2700+ elite players. Uncertainty speaks a gap of 5 Elo points for every 100 under perfection, which makes a considerable 70–point difference for Elo-2000 players.
Well, we can try clumping the data into huger piles. I threw out data below 1600 and the 2800 endpoint—which has lots of Carlsen but currently excludes Karjakin since his 2772 is below 2780. I combined blocks of four levels at 1600–1750, 1800–1950, up to 2600–2750, and quadrupled the bucket size to match. Here is the plot for 2200–2350, with a move-weighted average of 2268:
With over 500,000 data points, mirrored to over a million, can one imagine a more perfect fit to a logistic curve? With Stockfish the value even prints as unity. And yet, this is arguably the worst offender in the plot of over these six piles:
The point for 2600–2750 goes down. It is plotted at 2645 since there are far many more 2600s than 2700s players, and it must be said that the 2400–2550 pile has its center 2488 north of 2475 because 2550 included all years whereas the 2000–2500 range starts in the year 2006. But the data point for 2200–2350 is smack in the middle of this range. Why is it so askew that neither regression line comes anywhere near the error bars for the data taken with the respective engine?
Getting a fixed value for the ratio is vital to putting engines on a common scale that works for all players. The above is anything but—and I haven’t even told what happens when Rybka and Houdini enter the picture. It feels like the engines diverge not based on their evaluation scales alone but on the differences in their values for inferior moves that human players tend to make, differences that per the part-I post correspond almost perfectly to rating. Given Amir Ban’s stated imperative to conform any program’s values to a logistic scale in order to maximize its playing strength, and the incredible fit of such a scale at all individual rating levels, how can this be?
I get similar wonkiness when I try to tune the ratio internally in my model, for instance to equalize IPRs produced with Komodo and Stockfish versions to those based on Rybka 3. There is also an imperative to corroborate results obtained via one engine in my cheating tests by executing the same process with test data from a different engine. This has been analogized to the ‘A’ and ‘B’ samples in doping tests for cycling, though those are taken at the same time and processed with the same “lab engine.”
I had hoped—indeed expected—that a stable conversion factor would enable the desirable goal of using the same model equations for both tests. I’ve become convinced this year that instead it will need voluminous separate training on separate data for each engine and engine version. A hint of why comes from just looking at the last pair of Komodo and Stockfish plots. All runs skip the bucket for an exact 0.00 value, which by symmetry always maps to 0.50. Its absence leaves a gap in Komodo’s plot, meaning that Komodo’s neighboring nonzero values carry more weight of imbalance in the players’ prospects than do 0.01 or -0.02 etc. coming from Stockfish. The data has 48,693 values of 0.00 given by Komodo 10 to only 43,176 given by Stockfish 7. Whereas, Komodo has only 42,350 values in the adjacent ranges -0.10 to -0.01 and +0.01 to +0.10 (before symmetrizing) to 47,768 by Stockfish. The divergence in plot results may be amplified by the “firewall at zero” phenomenon I observed last January. The logistic curves are dandy but don’t show the cardinalities of buckets, nor other higher-moment effects.
In the meantime I’ve been using conservative ratios for the other engines relative to Rybka. For example, my IPRs computed in such manner with Komodo 10 are:
These are all 70–100 and so lower than the values I gave using Rybka. Critics of the regular match games in particular might agree more with these than my higher official numbers, but this needs to be said: When I computed the Rybka-based IPR for the aggregate of moves in all world championship matches since FIDE’s adoption of Elo ratings in 1971, and compared it with the move-weighted average of the Elo ratings of the players at the time of each match, the two figures agreed within 2 Elo points. Similarly weighting the IPRs for each match in my compendium gives almost the same accuracy.
That buttresses my particular model, but the present trouble happens before the data even gets to my model. Not even the scaling stage discussed in the last post is involved here. This throws up a raw existential question.
Much of data analytics is about “extracting the signal form the noise” when there is initially a lot of noise. Multiple layers of standard filters are applied to isolate phenomena. But here we are talking about raw data—no filters. All we have observed are the smooth linear correspondence between chess rating and average loss of position value and the even more perfect logistic relation between position value and win/draw/loss frequency. All we did was combine these two relations. The question is:
How did I manage to extract so much noise from such nearly-perfect signals?
Can you see an explanation for this wonkiness in my large data? What caveats for big-data analytics does it speak?
The chess answer is that Carlsen played 50.Qh6+!! and Karjakin instantly resigned, seeing Kxh6 51. Rh8 mate, and that after 50…gxh6 the other Rook drives home with 51. Rxf7 mate.
]]>
A head-scratching inconsistency in large amounts of chess data
Slate source |
Benjamin Franklin was the first American scientist and was sometimes called “The First American.” He also admired the American turkey, counter to our connotation of “turkey” as an awkward failure.
Today I wonder what advice Ben would give on an awkward, “frankly shocking,” situation with my large-scale chess data. This post is in two parts.
A common myth holds that Franklin advocated the turkey instead of the bald eagle for the Great Seal of the United States. In 1784, two years after the Great Seal design was approved over designs that included a bird-free one from Franklin, he wrote a letter to his daughter saying he was happy that the eagle on an emblem for Revolutionary War officers looked like a turkey. Whereas the eagle “is a Bird of bad moral Character [who] does not get his Living honestly” and “a rank coward,” the turkey is “in Comparison a much more respectable Bird, […and] though a little vain & silly, a Bird of Courage.” The Tony-winning 1969 musical 1776 cemented the myth by moving Franklin’s thoughts up eight years.
More to my point is a short letter Franklin wrote in 1747 at the height of his investigations into electricity. In his article “How Practical Was Benjamin Franklin’s Science,” Ierome Cohen summarizes it as admitting “that new experimental data seemed not to accord with his principles” and quotes (with Franklin’s emphasis):
“In going on with these Experiments, how many pretty systems do we build, which we soon find ourselves oblig’d to destroy! If there is no other Use discovered of Electricity, this, however, is something considerable, that it may help to make a vain Man humble.”
My problem, however, is that the humbling from data comes before stages of constructing my system. Cohen moves on to the conclusion of a 1749 followup letter and ascribes it to Franklin’s self-deprecating humor:
“Chagrined a little that we have been hitherto able to produce nothing in this way of use to mankind; [yet in prospect:] A turkey is to be killed for our dinner by the electrical shock, and roasted by the electrical jack, before a fire kindled by the electrified bottle: when the healths of all the famous electricians in England, Holland, France, and Germany are to be drank in electrified bumpers, under the discharge of guns from the electrical battery. ”
My backbone data set comprises all available games compiled by the ChessBase company in which both players were within 10 points of the same century or half-century mark in the Elo rating system. Elo ratings, as maintained by the World Chess Federation (FIDE) since 1971, range from Magnus Carlsen’s current 2853 down to 1000 which is typical of novice tournament players. National federations including the USCF track ratings below 1000 and may have their own scales. The rating depends only on results of games and so can be applied to any sport; the FiveThirtyEight website is currently using Elo ratings to predict NFL football games. Prediction depends only on the difference in ratings, not the absolute numbers—thus FiveThirtyEight’s use of a range centered on 1500 does not mean NFL teams are inferior to chess players. The linchpin is that a difference of 200 confers about a 75% expectation for the stronger player.
The difference of 81 to Sergey Karjakin’s 2772 gave Carlsen about a 61% points expectation, which FiveThirtyEight translated into an 88% chance of winning the match. This assumed that tiebreaks after a 6-6 tie—the situation we have today—would be a coinflip.
A chief goal of my work—besides testing allegations of human players cheating with computers during games—is to measure skill by analyzing the quality of a player’s moves directly rather than only by results of games. A top-level player may play only 100 games in a given year, a tiny sample, but those games will furnish on the order of 3,000 moves—excepting early “book” opening moves and positions where the game is all-but-over—which is a good sample. The 12 match games gave me an even better ratio since several games were long and tough: 517 moves for each player. My current model assesses Karjakin’s level of play in these games at 2890 +- 125, Carlsen’s at 2835 +- 135, with a combined level of 2865 +- 90 over 1,034 moves. The two-sigma error bars ward against concluding that Karjakin has outplayed Carlsen, but they do allow that Karjakin brought his “A-game” to New York and has played tough despite being on the ropes in games 3 and 4. No prediction for today’s faster-paced tiebreak games is ventured. (As we post, Carlsen missed wins in the second of four “Rapid” paced games; they are playing the third now still all-square.)
These figures are based on my earlier training sets from the years 2006–2013 on Elo century points 2000 through 2700, in which I analyzed positions using the former-champion Rybka 3 chess program. Rybka 3 is now far excelled by today’s two champion programs, called Komodo and Stockfish. I have more than quadrupled the data by adding the half-century marks and using all years since 1971, except that the range 2000-to-2500, with by far the most published games, uses the years 2006–2015. In all it has 2,926,802 positions over 48,416 games. The milepost radius is widened from 10 to 15 Elo points for the levels 1500–1750 and 2750, to 20 for 1400–1450 and 2800, and to 25 for 1050–1350. All levels have at least 20,000 positions except 1050–1150, while 2050–2300 and 2400 have over 100,000 positions each and 2550 (which was extended over all years) has 203,425.
One factor that goes into my “Intrinsic Performance Ratings” is the aggregate error from moves the computer judges were inferior. All major programs—called engines—output values in discrete units of 0.01 called centipawns. For instance, a move value of +0.48 leaves the player figuratively almost half a pawn ahead, while -0.27 means a slight disadvantage. If the former move is optimal but the player makes the latter move, the raw difference is 0.75. Different engines have their own scales—even human chess authorities differ on whether to count a Queen as 9 or 10—and the problem of finding a common scale is the heart of my turkey.
Here are my plots of average raw difference (AD) over all of my thirty-six rating mileposts with the official Komodo 10.0 and Stockfish 7 versions. Linear regression, weighted by the number of moves for each milepost, was done from AD to Elo, so that the rating of zero error shows as the -intercept.
Having means these are fantastic fits—although some “noise” is evident below the 1900 level and more below 1600, it straddles the fit line well until the bottom. Although the range for Elo 2000 through 2500 is limited to years after 2006 there is no discontinuity with neighboring levels which include all years. This adds to my other evidence against any significant “rating inflation”—apart from a small effect explainable by faster “standard” time controls since the mid-1990s, the quotient from rating to intrinsic quality of play has remained remarkably stable.
The scales between Komodo and Stockfish are also quite close. I am using Stockfish as baseline since it is open-source; to bring Komodo onto its scale here suggests multiplying its values by . The first portent of trouble from higher moments, however, comes from the error-bar ratio being tangibly different.
Komodo 10 and Stockfish 7 agree within their error bars on the -intercept, but both place the “rating of perfect play” at most 3200. This is markedly below their published ratings of 3337 and 3354 on the CCRL rating list for the 64-bit single-core versions which I used. This is for “semi-rapid” chess but is meant to carry over to standard time controls. The ratings of slightly later versions on TCEC for standard time controls are both about 3230. This is the source of my quip on the Game 7 broadcast about the appearance of “computers being rated higher than ‘God’.”
A second issue is shown by graphing the average raw error as a function of the overall position value (i.e., the value of an optimal move) judged by the program, for players at any one rating level. Here they are with Stockfish 7 for the levels 1400, 1800, 2200, and 2600 (a few low-weight high outliers near the -4.0 limit have been scrubbed):
If taken at face value, this would say e.g. that 2600-level players, strong grandmasters, play twice as badly (0.12 error) when they are 0.75 ahead as when the game is even (0.06). Tamal Biswas and I found evidence against a claim that this effect is rational. Hence it is a second problem.
What has most immediately distinguished mine from others’ work since 2008 is that I correct for this effect by scaling the raw errors. My scaling function applies locally to each move, using only its value and the overall position value of the best move. I regard it as important that the function is “oblivious” to any regression information hinting the overall level of play. Here are the results of applying it for Stockfish at the Elo 1800 and 2200 levels:
The plot for Elo 2050 is almost perfect, while the flattening remains good especially on the positive side throughout the range 1600–2500 which has almost all the data. I call the modified error metric ASD for average scaled difference, which is in units of “PEPs” for “pawns in equal positions” since the correction metric has value at . Details are as in my papers except that upon seeing a “firewall” effect become more glaring with larger data, I altered its coefficients between the cases and . Here are the resulting plots of ASD versus Elo with Komodo 10 and Stockfish 7:
The fits are even better and with the previous “noise” at lower Elo levels optically much reduced. The -intercepts are now near 3400. This removes the previous conflict with ratings of computers but still leaves little headroom for improving them—an issue I discussed a year ago.
Most to the point, however, is that choosing to do scaling made a hugely significant difference in the intercept. The scaling is well-motivated, but the AD-to-Elo fit was already great without it. I could have stopped and said my evidence—from very large data—pegs perfection at 3200. This prompts us to ask:
How often does it happen that data-driven results have such degree of hidden arbitrariness?
I am sure Ben Franklin would have had some wise words about this. But we haven’t even gotten to the real issue yet.
What lessons for “Big Data” are developing here? To be continued after the match playoff ends…
[made figure labels more consistent with text, updated data size figures]
src |
Ken and I wish to thank all who read and follow us. May you have a wonderful day today all day.
But we would like to pose a basic question about teaching complexity theory: Theorems vs. Proofs.
Because today is in the US a national holiday I am not teaching my class on complexity theory, nor is Ken teaching his. I like the class, but I do enjoy the time off from lecturing. Still it seems like a time to reflect on a simple question about teaching.
Today is, of course in the US, Thanksgiving Day. We watch parades, really mainly the Macy’s Thanksgiving Day Parade; we watch football, that is NFL style football; and we watch our waist-lines expand as we eat too much wonderful food—my favorite is the turkey, covered in gravy and served with mashed potatoes.
So while you are enjoying your day let Ken and I ask you a simple question.
What we are interested in is this: Is it as important to know the statement of a theorem as it is to know the proof of the theorem?
I think we almost always when teaching follow the following paradigm:
Thus our question is: Can we skip presenting the proof? Do students still learn something important if they know the statement only of a theorem, but do not learn the proof—or even an outline of a proof? I have wondered over the years of teaching, especially a course like complexity theory, whether we must give both theorem statements and proofs.
There are of course many situations in math where we know the but not the . Perhaps the most famous example is the classification of simple finite groups. This theorem gets used by theory papers but I believe that almost no one applying it knows the proof. You could argue that this is an extreme example, but there are many others that come to mind: the famous regularity theorem of Endre Szemerédi can be used I believe without knowing the proof. As an extreme example I have wondered whether it would be worth it to increase the material I present in class and do this by only proving a small subset of theorems.
I (Ken) am teaching our graduate theory of computation course. This course was until recently required of all PhD students. I still teach it for non-specialists and with emphasis on how to craft a technical argument and write an essay answer—skills for thesis writing in general.
I present some proofs in full and skip or “handwave” others. My full proofs highlight algorithmic ideas and logical structure. For instance, I explain how the proof that nondeterministic space is contained in deterministic time embodies breadth-first search, while nondeterministic time being in deterministic space can be treated as depth-first search. I fold together the proofs of the deterministic space and time hierarchy theorems while diagramming the offline universal simulation they embody. In proving the -completeness of I highlight how re-using variables makes a double-branch recursion into a single branch, and state what I call a “modified proverb of Lao-zi”:
A journey of a thousand miles has a step that is exactly 500 miles from the beginning and 500 miles from the end.
I skip, however, most of the proof of the simulation of a -tape Turing machine by a two-tape oblivious Turing machine . What I show is the division of the first tape into blocks of cells numbered
and the following sequence of “jags”:
Each “jag” for a number begins at cell 0, goes to , then crosses 0 on the way to cell , and returns to 0. I explain that each jag simulates one step of , and finally show or state that the total number of steps by up to the -th jag is .
I prove the theorem of Walter Savitch that nondeterministic space is contained in deterministic space , but only state the theorem that it is closed under complements. That proof I would reserve for an advanced graduate course. Overall I like to highlight a “message” in each proof, such as “software can be efficiently burned into hardware” for the simulation of Turing machines by circuits. This sets up the circuit-based version of the -completeness of , which illustrates formal verification of hardware, and subsequent -completeness theorems as showing how many combinatorial mechanisms embody formal logic in turn.
Enjoy today. If you have a moment between watching the games and eating and other activities please let us know about your thoughts on theorems vs. proofs.
]]>
Dick and I will be on Sunday’s game telecast
Business Insider source |
Magnus Carlsen of Norway and Sergey Karjakin of Russia are midway through their world championship match in New York City. The match is organized by Agon Limited in partnership with the World Chess Federation (FIDE).
Tomorrow, Sunday—early today as I post—at 2pm ET is Game 7 with the match all square after six hard-fought draws. Dick and I are in New York City and will be on the telecast streamed by the sponsoring website, WorldChess.com. A one-time $15 charge brings access to that and all remaining games.
The match is being covered by major media. The movie documentary “Carlsen” opened yesterday. I was also struck by game–by–game coverage on the FiveThirtyEight website, including a post Saturday titled, “Are Computers Draining the Beauty Out Of Chess?”
With us on the gamecast will be Murray Campbell of IBM Watson. He was one of the creators of the machine Deep Blue, which famously defeated Garry Kasparov in ealy 1997. Since then no human player has battled a computer on even terms, while both software and hardware have improved to the point that Kasparov would probably lose to his phone. That is why I have helped draft rules against smartphone in tournament halls and much else as an official consultant of a FIDE Commission to combat cheating, whose chair Israel Gelfer shared lunch with Dick and Kathryn Farley and me earlier today. I will be wearing my deep-blue dress shirt in tribute.
The players occupy a cubicle behind a partition from the main audience. Ever since the 2006 championship reunification match in which Veselin Topalov accused Vladimir Kramnik of getting computer help, and mindful of past whispers about signals, FIDE has reserved the option of forestalling any possible audience input. Cameras show the on-board action. Expert commentators give running analysis for those onsite and the Internet audience. The broadcast team is anchored by Judit Polgar, who in 2005 was the first woman to compete in a round-robin tournament for the FIDE world title, and television journalist Kaja Snare, who previously worked for Norway’s TV2 network.
The games start at 2pm. Each player has a budget of 100 minutes for the first 40 moves plus 30 seconds “increment” after each move played, so four hours may elapse before the game reaches move 40. Then 50 minutes plus the increment are allotted until move 60, then a final 15 minutes plus the increment for the rest of the game. Although 40 is a typical game length, the six draws have averaged 55 moves per game. Games 3 and 4 saw Karjakin hold out for 78 and 92 moves in positions that at times were desperate. Those games were said to have kept Norwegian government ministers up until 3am and slowed the country. Friday’s game, however, finished early—and it must be said as caveat that a short game could cut the time for any of us on the broadcast.
Carlsen is rated 2853 on the Elo rating system, which is 2 points above the record high previously held by Kasparov but about 30 below Carlsen’s own peak. Karjakin is at 2772, which makes him a slight but definite underdog. Arpad Elo designed his rating system in 1960 for the United States Chess Federation and it was adopted by FIDE in 1970. Only relative numbers matter: a linchpin is that a 200 point difference reflects and predicts the stronger player taking about 75% of the points.
The change in one’s Elo rating after a tournament or match depends only on one’s win-draw-loss record and the ratings of one’s opponents. This simplicity makes it easily adaptable to other sports, and FiveThirtyEight uses Elo for their in-house predictions of football games and baseball series among other games. My own work, however, gauges a player’s performance on the Elo scale directly by analysis of the moves he or she played—within a deeper analysis of the moves not played. On that scale I have Carlsen and Karjakin playing dead-even at a very high level, though with considerable 95% confidence error bars:
Carlsen 2880 +- 165; Karjakin 2875 +- 170.
This is reflected also in a less-intensive “screening run” I have devised for quick assessment of large tournaments. It produces a value I call ROI for “Raw Outlier Index” on a 0–100 scale, where 50 is the expected agreement with a particular computer program given one’s rating. My tests using the Stockfish 7 and Komodo 10.2 programs both give the players a combined ROI of 51, with Stockfish giving them 51 apiece. I look forward to explaining how one can design a model that gets things yea-close.
Who will win? Will either one win tomorrow’s game? We welcome you to catch the action.
Update 11/21: As it happened, the game was drawn 15 minutes into the segment where I was appearing—just when I was affirming an opinion by Judit Polgar about human-computer teamwork by pointing to my joint results on “Freestyle” chess. Dick and Murray did not appear. It was still a great experience
Update 11/22: Karjakin sensationally won yesterday’s game to take a 4.5–3.5 lead with 4 games to play. Updated IPR figures: Karjakin 2845 +- 160, Carlsen 2760 +- 180.
Update 11/25: Carlsen evened the match to stand 5-5 after 10 games. IPRs are now Carlsen 2825 +- 145, Karjakin 2875 +- 135, combined 2850 +- 100.
[added caveat about short games, note on broadcast team]
Cropped from source |
Nate Silver has gone out on a limb. Four years ago we posted on how the forecast of his team at FiveThirtyEight jived with polls and forecasts by other poll aggregators. This year there is no jive.
Today, Election Day in the USA, we discuss the state of those stating the state of the election.
FiveThirtyEight has the election much closer than most of the other forecasters do. But Silver is no “nut”—last election, in 2012, he was right about the winner of all 50 states and the District of Columbia.
As of their Tuesday morning update, they gave Donald Trump almost a 30% chance of winning, against 70% for Hillary Clinton. For contrast, the Princeton Election Consortium site of Sam Wang and Julian Zelizer has had Clinton over 99% probability in both its “random drift” and “Bayesian” measures, and the Huffington Post gave her 98.2%. Nate Cohn’s New York Times Upshot model put Trump with a 16% chance, but that is still only half what FiveThirtyEight has been giving him. The next-higher numbers in forecasts compared here gave Trump 12% and 11%. Senate forecasts have had similar disparity.
This past weekend, Silver was called out by Ryan Grim in a Huffington article titled, “Nate Silver Is Unskewing Polls—All Of Them—In Trump’s Direction.” The term “unskewing polls” means altering assumptions about the makeup of polling samples to correct perceived bias. In 2012 the complaints of bias in the data used by Silver came mainly from the Republican side and were proved wrong by the results. This year the thunder about numbers seems all on the left.
The main difference cited by Silver is the higher number of voters telling pollsters they are undecided or supporting third-party candidates compared to 2012. There is also greater uncertainty about the effects of news developments such as releases by Wikileaks, the FBI investigation into Clinton’s e-mail server, Obamacare premium hikes, and scandalous past behavior by various people.
There have also been greater movements in polls. Here is the graph of Silver’s forecasts from 2012, when FiveThirtyEight was a blog of the New York Times:
The one counter-trend came after Barack Obama’s poor performance in the first debate with Mitt Romney. There is no evidence that Hurricane Sandy had any effect at the end of October 2012. Now here is the current graph of FiveThirtyEight’s odds over the past few months:
The first sharp movement was registered the week after FBI Director James Comey’s July 5 press conference characterizing Clinton’s e-mail use as “reckless” but not indictable. That brought FiveThirtyEight’s model to parity on July 30, two days after the end of the Democratic convention, but polls completed the next week shot back and continued amid Trump’s unseemly tangling with Khizr and Ghazala Khan. A long trend back to parity, perhaps accelerated by Clinton’s “bad weekend” of Sept. 9–11, bounced again following the first debate on Sept. 26th. The past four weeks have seen a rounding turn into a slide correlated with the Oct. 28 FBI letter re-opening the e-mail investigation of Clinton, and just in the past two days a 7-point jag. The New York Times shows similar movements but not as sharp:
Others have similar graphs. What go into these aggregate models are the polls, and by and large the polls have shown similar movements. Hence I think the key this time is not unskewing the polls but rather the electorate.
I’ve been musing on the possible relevance of freighted phenomena I’ve found while extending my chess model since spring. Heretofore I’ve focused on projecting the best moves; now I want to refine accurate projections for all the moves in a given position. Doing so will confer authority on statistical tests for whole categories of moves—such as captures, moves with Knights, moves that advance or retreat, and moves within a given range of inferiority.
A year ago I reported on work with my student Tamal Biswas, who is now on the University of Rochester faculty after defending his dissertation in July, on implementing a parameter for “depth of thinking.” Computer chess programs all work in rounds of increasing depth of search, and this furnishes an axis of time for human players thinking in the same positions.
Our papers linked from that post show that swings in a program’s value for a given move as the search progresses correlate mightily with the frequency of the human players choosing (or having chosen) that move. For instance, we noted that even for the world’s best players, the frequency can range from 30% to 70% depending only a numerical measure of the swing formulated by Tamal, with the ultimate value of the move in relation to values of alternative moves being held equal. The swing measure also perfectly numerically explains a puzzling “law” which I posted about four years ago.
Last year’s post, however, also reported extreme difficulties with modeling a depth-of-thinking parameter directly. Hence we’re trying a simpler tack of fitting a multiplier on the swing quantity . The ‘h’ is for “heave” by analogy with a ship riding above or below the water line. My usage is not quite “nautically correct”: a ship will heave to for stability in wavy seas, whereas my measures the tendency to be carried away by them. But my modeling supports the following interpretation:
A value h > 1 means that the player(s) are influenced more strongly by swings in values than by the ultimate objective values themselves.
Where previously I had a term relating the difference in value between a move and the machine’s best move to my model’s “sensitivity” parameter , now I have terms like
involving as well. The measure is formulated as an average of values over all depths of search, so I am confident that its units support the interpretation. There are further wrinkles according to whether the overall position value and/or the swing values are negative, and they are all immersed in only-halfway-better forms of the above-mentioned fitting difficulties, so anything I say now is preliminary. But what I am seeing seems consistent enough to report the following:
For chess players of all Elo ratings from novice levels 1050, 1100, 1150, 1200, … to the world championship standard of 2800, the h values are by-and-large all in the range 1.3 to 2.3, and concentrated in 1.5 to 2.0.
I can’t even yet say that I have a regular progression by rating, even though outside the levels 2000 through 2500 (which are most heavily populated among the millions of anthologized games), my training sets have all available games between players at each level (within 10-to-25 Elo points depending), giving tens to hundreds of thousands of data points for each level.
My original model has neatly linear progressions in and in a second “consistency” parameter . A second indication that the “high-heave” phenomenon is real is that the three-parameter fits which I obtained in August make the progression steeper and throw the progression into retrograde as a damper. This unwelcome latter fact is a prime reason for tinkering further, besides the fitting landscape being no longer benign.
Thus I believe my model is currently being mathematically inconvenienced by people’s tendency to play moves on impulse and react to (changes in) trend. The measure ticks up when a move suddenly looks better at depth than it did at depth . Results in the papers with Tamal so far support the idea that humans considering such moves experience a corresponding uptick in their estimation. From my own games I recall times I’ve played a move when it suddenly “improved,” then regretted not thinking more on whether it was really better than alternatives.
To repeat, the chess work has not yet reached the point of fully substantiating the effect of swings in value. It is however enough to make me wonder when I see things like FiveThirtyEight’s graph of the race for party control of the Senate:
Are respondents being influenced more strongly by “political weather” than by a prior valuation of their candidates? Note especially the inflection after Comey’s Oct. 28 letter.
The polls are still open in many places as we post, and we have much less idea than we thought four and eight years ago of how things will shake out. Even after all votes are counted it may be hard to tell whether Silver was closer than the others. A strong Clinton win could be carried by the last-day upswing noted in FiveThirtyEight’s graph above, noting also its absence in the Senate graph. Let alone that the election might not be over by tomorrow, to judge by the squeaker in 2000, it will certainly take a long time to parse and “unskew” the election results.
How will we analyze the results of this election? And of course, who will win?
Update 10/9: As it shook out, Silver was merely the least wrong. The USC Dornsife / LA Times poll was distinctive in showing Trump ahead most of the time:
Likewise the Investor’s Business Daily / TechnoMetrica Market Intelligence poll. But even these need to be squared with Clinton’s evidently winning the popular vote. Update 10/10: Silver has a new article showing the effect of a 2% swing, meaning Trump’s share down 1% and Clinton’s up 1%.
[word changes, added links in intro, added update]
Head chopped from source |
Washington Irving was a famous writer of the early 1800’s who is best known for his short stories. The Legend of Sleepy Hollow was based on the folklore that each Halloween a decapitated Hessian soldier, killed in the American Revolution, rises as a ghost, a nasty ghost, who searches for his lost head.
Today is Halloween and while Ken and I are not searching for any lost heads, we do believe it is a good day to think about scary stories.
It’s Halloween—variously Allhalloween, All Hallows’ Eve, or All Saints’ Eve—and we thought we share some really scary results with you. It is the beginning of Allhallowtide, which includes All Saints’ Day on Nov. 1 and All Souls’ Day on Nov. 2.
Here are some of the top scariest results we can imagine happening on Halloween. May they not happen—may you and we get treats, not tricks.
A New Simple Group
A group of physicists at CERN have been working on a string theory in 1,729 dimensions. In using higher-order amplituhedra to remove infinities, they discovered a new large symmetric structure. They noticed that the group of this structure seemed interesting. And it was. Group theorists at CalTech have verified that it is not in the current list of simple groups.
Group theorists have divided into two “groups”: those looking for where the error occurred in the current proof, and those checking which applications of the classification still are correct theorems.
A Special Even Number
A teen in high school for her science project studied the curious family of numbers
where and both and are primes. She gave an analytical proof that if there are infinitely many such numbers then not all of them can be the sum of two primes. The proof is not elementary but is clever and seems correct. Number theorists are of course upset because this implies that either the Twin Prime conjecture or the Goldbach conjecture is false—and the proof doesn’t tell which.
Quasi-Gems
An infinite sequence of integers has been found such that the polynomials defined by and recursively for ,
have at least distinct integer roots in the range . For the slight disruption this could cause see this post.
Sam Bonwit
SIGACT just accepted a paper in advance for the 2017 STOC conference. The sole author is named Sam Bonwit. It extends several recent FOCS papers on theoretical aspects of machine learning (ML), beginning with a short proof of a previous very difficult theorem in ML. The committee has just discovered that the paper was created completely by a deep learning algorithm, with no human intervention.
A Complexity Result
The class , logspace, has just been shown to be equal to . The proof seems right and of course solves the P=NP question. Besides the shock the theory community is trying to see what can be saved from the past, since many conditional theorems are now gone. A very bad trick.
Nobel Less Oblige?
A group of physicists not at CERN have been working on a non-string theory in 4 spacetime dimensions. They have proved that for any universe is which the cosmological constant Lambda is not exactly zero, spacetime explodes with the intensity of quadrillions of hydrogen bombs per cubic nanometer per nanosecond. It is a consequence of the wave-particle duality for inflation. This has led other scientists to consider revising the statistical confidence for dark energy.
What are your scary results that you could imagine?
AIA source |
Louise Bethune was the first female professional architect in the United States, and possibly the world. She worked in Buffalo in the late 1800s through the early part of the 20th century.
Today we roll out ideas for an initiative on attracting women to computer science.
When Bethune worked there were undoubtedly no initiatives for attracting women into architecture. One imagines it was the opposite. Yet she designed the Hotel Lafayette, which is again going strong following a 2012 restoration to its original grandeur. She announced the formation of her own architectural firm shortly before marrying Canadian architect Robert Bethune, who partnered what became Bethune, Bethune & Fuchs. She was the first woman to be elected to the American Institute of Architects (AIA), first as an associate and a year later as a Fellow.
To judge by Wikipedia’s article on women in architecture, it is unclear whether any woman had a regular leading professional role in architecture before her. Nor do we know any stories from antiquity of female builders, say to compare with Hypatia in mathematics. Indeed if we broaden to all of engineering, the women listed here are first Hypatia, second—guess who?—, and then the third and following ones are all contemporaries of Bethune.
I like to compare computing to architecture. In this post we hailed algorithmic tools as “erecting a New World [rather] than discovering one.” In departmental commencement speeches, I’ve said how Filippo Brunelleschi exemplified the ancient Greek maxim “Know Thyself” in his calculations of how his design for the dome of Florence’s cathedral would stand without buttresses—but how in our new kind of architecture it is more vital to “Document Thyself.” So I am happy to use Bethune as the local Buffalo face of my initiative, and the comparisons that follow are in no way meant to shade her.
I am writing under the prospect that America is about to elect our first woman president. Our Buffalo Bills this year hired Kathryn Smith as the first woman in a regular full-time NFL coaching position. This came a year after the NFL hired the first female referee and two years after Becky Hammon became the first female coach in the NBA.
I could go on with recent examples of “the first woman X.” What strikes me is that computing is singularly blessed with women who have been the the first X, no qualifier:
Ada Lovelace, 1843: First Published Programming Paper, First Public Programs. As I wrote about her last year, she first translated a paper that had been written the previous year in French by Luigi Menabrea from his notes on Charles Babbage’s lectures in Turin on his Analytical Engine and example programs. Then she appended “Notes” that were twice as long and contained larger programs and analysis of them. The point of my article was to engage a scholarly consensus that shadows her contributions under Charles Babbage, by comparing it with the PhD advisor-student relationship today. After a detailed critical review I concluded that she deserves primacy on at least the completed Bernoulli numbers program (“Note G”) and on framing certain programming issues that still resonate today. This recently-updated fact-check chides those who call her “the first programmer”:
The problem is of course, that this version of the story omits Babbage’s programs written years before Ada’s similar, but more complex, program.
But it notes that in contrast to Babbage’s programs, hers as published are error-free. On balance it affirms my points, to which I’ll add her paper’s significance as the first example of open source. She is the “guess who” on the engineers list mentioned above.
The ENIAC programmers, 1940s: Betty Jean Bartik, Frances Holberton, Kathleen McNulty, Marlyn Meltzer, Frances Spence, and Ruth Teitelbaum. They were the programmers of the first publicly-known all-electronic computer. Holberton also developed one of the first automatic program generators, SORT/MERGE, around John von Neumann’s MergeSort algorithm, and Bartik joined the development of the BINAC I and UNIVAC computers. We can add Adele Goldstine, whose 1946 report on the ENIAC was the second computer manual after von Neumann’s famous EDVAC report the year before.
Grace Hopper, 1940s, 1950s, and later: First Compiler Hopper started as a programmer on the classified Mark I machine in 1944, then in 1951 designed the first compiler for her A-0 programming system for the UNIVAC. This is linked with Holberton’s SORT/MERGE in Russell McGee’s recollection of early computers as “the germ of what would be some very important future developments” including the conception of COBOL. In 1952 she wrote a paper on her compiler, which was implemented by Margaret Harper and presaged an automatic program editor written by Adele Coss.
Mary Hawes, Jean Sammet, and Gertrude Tierney, c. 1960: Development of COBOL. They were all on the COBOL design committee and each took leading roles on stages of it. Sammet wrote the book Programming Languages: History and Fundamentals—in 1969.
Margaret Hamilton: First Software Engineer. There are others who might lay claim to that title, but she coined the term. We’d be interested to know whether her voluminous control code for the Apollo 11 moon mission became the first large-scale public demonstration of fault-tolerance when the code by her and Hal Laning handled an error condition from an incorrectly set switch during the landing phase.
Per this source, over 250,000 unique lines of code by Hamilton and team. |
Adele Goldberg, Smalltalk. She was an integral part of the Smalltalk-80 design committee. Shared firsts count too.
From related fields we could include Florence Nightingale for the first statistical inference from medical data in the 1850s—she is also counted as a founder of statistics—and Edith Clarke for the first patent of a graphical calculator, in 1925. We’ve also posted on Hedy Lamarr’s spread-spectrum invention. We’ll be happy to hear more reader’s favorites right up to the present day.
Indeed, many more female computing pioneers are listed here among other places. It is great to look up to them. However we need to look around us in the present on our campuses. What we see is women severely under-represented in the computing major. Whatever one’s opinions on why, the graphs showing women falling from 36% of CS majors in 1984 to 18% now are stunning, even after noting ups and downs in the CS major on the whole.
Discussions of why have ranged from the 1980s advent of home PCs being perceived as “boy toys” to failure to explain in childhood the career need of tech. But even the graphic for the latter’s argument shows a huge drop in high school and college. Thus there is a clear need for work at the college entry stage.
I side with those who decry the emphasis on gaming in intro CS courses. There seems a greater disparity between male and female participation in video gaming than I perceive in chess, at least at the top. For one instance, League of Legends—whose world championships will be watched by hundreds of thousands this weekend—had its first female pro player only last year, and she left earlier this year. One can find conflicting evidence regarding greater parity in the population overall, depending on level of self-identification and involvement. As with chess, this is set against a paradox. Computer gaming involves no physical attribute that segregates men’s and women’s sports, and there is no commanding evidence of separation in cognitive or reactive skills. Granting this means that other factors must be responsible for disparities.
We believe that the current diversification of computer science fields will enlarge the store of good examples. Game-building has always furnished good examples: it provides quick feedback and appreciation of results and a springboard for event-driven, real-time, objects-first, and even multi-threaded programming early on, besides procedural code. Recently we are seeing concerted efforts to diversify the examples, such as Google’s “Made With Code” and the approaches described at universities here. It still remains to connect the examples to career enthusiasm and opportunity. That’s where the theme of the initiative described here comes in.
The fluidity of tech makes leadership all the more a necessity—as implied when the ethic of the “startup of you” is widened to cover all college graduates. So I—and many of my colleagues—felt it would be helpful to invite female Distinguished Speakers leading up to our department’s 50th anniversary celebration next year who can show their pioneering work in several of these diverse fields. The idea is to showcase the many opportunities in our extraordinarily diversifying field for creative design and initiative, both of which imply being the first at something.
We are proud to have the following speakers and talks in our 50th anniversary academic year. All of the talks are free and open to the public, and are held on Thursdays, 3:30–4:30pm, in the UB Student Union Theater.
The series will be capped by Mary Jane Irwin, Penn State, speaking during our 50th anniversary celebration itself on September 28–30, 2017. She is renowned for computer architecture and its interface to software systems. One of her projects involving both is SPARTA, for “Simulation of Physics on a Real-Time Architecture.” Her National Academy of Engineering citation lists VLSI Architecture and Automated Design, and by this we will bring the analogy with building architecture full-circle.
Open Problems
How do you feel that this approach will meld with ones mentioned in our linked stories that are giving results? Do leadership and creative pioneering work as themes by which to bridge from the heroes we’ve listed to today’s workplace reality? Further suggestions are most welcome.
]]>
Some football wisdom from Dick Karp
Cropped from S.I. Kids source |
John Urschel is a PhD student in the Applied Mathematics program at MIT. He has co-authored two papers with his Penn State Master’s advisor, Ludmil Zikatanov, on spectra-based approximation algorithms for the -complete graph bisection problem. A followup paper with Zikatanov and two others exploited the earlier work to give new fast solvers for minimal eigenvectors of graph Laplacians. He also plays offensive guard for the NFL Baltimore Ravens.
Today Ken and I wish to talk about a new result by the front linesman of -completeness, Dick Karp, about football.
Karp—Dick—-is a huge fan of various sports. Recently at Avi Wigderson’s birthday conference, held at Princeton’s IAS, he told me a neat new result on how to play football. It concerns a basic decision that a coach faces after his team scores a touchdown: kick for one extra point or attempt a riskier “two-point conversion” play.
We wonder whether players like Urschel might be useful for blocking out these decisions, more than blocking and opening “holes” for runners. By the way Urschel, who attended Canisius High School in Buffalo before his B.S. and M.S. plus a Stats minor at Penn State, is no stranger to IAS. Here he is with my friend Peter Sarnak there last year:
AMS Notices feature source |
Thus he is doing all the things we would advise a young theorist or mathematician to do: publish, circulate, talk about interesting problems, get on a research team, and open up avenues for deep penetration and advances by teammates.
After a touchdown is made the team scoring gets 6 points. It then has an option:
Traditionally the right call in most game situations is (1). Usually the kicker can make the ball go through the posts most of the time, while getting the ball into the end zone is much more difficult. Of course at the end of a game there may be reasons to try for the 2 points. If the game is about over and you need 2 points to tie, that is probably the best play.
Karp set up a basic model of this decision. His model is a bit idealized, but I expect that he can and will make it more realistic in the future. The version I heard over a wonderful dinner that Kathryn Farley, my wife, set up for a small group of friends was not the proper venue to go into various technical details. So I will just relay his basic idea.
In his model we make several assumptions:
The last clause means that we’re modeling the choice by an infinite walk. If you wish you may subtract 1 so that a kick gives 0, a successful play gives +1, but an unsuccessful one gives -1. Karp’s question is this:
What should the coach do? Always kick or sometimes go for two?
You might think about it before reading on for Karp’s answer.
His insight is this:
Theorem 1 (Fundamental Theorem of Football?) The optimal strategy is initially always to go for two. If after some number of tries you have succeeded times, so that you are ahead of what kicking would have brought, switch over to kicking.
Ken’s first reaction was to note a difference from “gambler’s ruin” which means to double-down after every lost 50-50 bet. In football this would mean that after you missed one conversion play, the next try would bring +3 points on success but subtract 1 from your score if you missed. Next time you could go for 5 but failure would cost 3. If you think in the +1/-1 terms compared to 0 for kicking, then this is the classic martingale system of doubling the bet 1,2,4,8… until you win and net +1. The ruin for gamblers is the chance of swiftly going bankrupt—but in football you can only lose one game.
However, we are not allowing doubling the bet either. It’s the classic random walk situation: right or left along the number line with equal probability, except that you can elect to stop and stay where you are. With probability 1, such a walk starting at 0 will reach a stage in positive territory at +1 net, and then we stop.
A real game, of course, does not have unboundedly many touchdowns—though some college games I’ve watched have sure felt like it. So if you miss a few two-point tries and the game is deep into the second half, you’re left holding the bag of foregone points.
The question comes down to, what is the utility of nosing ahead by the extra point when you succeed, compared to being down more when you fail? How likely are game scenarios where that one extra point is the decider? To be concrete, suppose you score 3 touchdowns in the first three quarters. Following Karp’s strategy nets you an extra point 5/8 of the time: succeed the first time then kick twice, or fail the first time and succeed twice. Just 1/4 of the time you’ve lost a point, but 1/8 of the time you’re net -3 and need an extra field goal to get back to par.
There are late-game situations where the extra point is worth so much that it pays to go for two even with a chance of success that is under 40%. Suppose you are down 14 points and score a touchdown with 4 minutes left. You have enough time to stop the other team and get the ball back again for one more drive, but that’s basically it. You have to assume you will score a touchdown on that drive too, so the only variable is the conversion decision. The issue is that if you kick now and kick again, you’ve only tied the game and have a 50% win chance in overtime. Whereas if you go for two you have an chance of winning based on this figuring, plus if you fail you can still tie and get to overtime after your next TD. Thus your win expectation is
which crosses 50% when
When you have a expectation by going for two. Yet for human reasons, this is not on the standard chart of game situations calling for a two-point try. The human bias is toward maximizing your chances of “staying in the game” which is not the same as maximizing winning. There was a neat analysis of a similar situation in chess last year.
The challenging question to deepen Karp’s insight is, how far can we sensibly broaden this kind of analysis? Does this observation apply in real games? It seems to call for a big modeling and simulation effort, guided by theory where we might vary Karp’s simple rule and adjust for different probabilities of success on both the two-point tries and the kicks. This would bring it into the realm of machine learning with high-dimensional data, and per remarks on Urschel’s MIT homepage, perhaps he is headed that way.
What do you think of the insight for football strategy? We could also talk about when (not) to punt…
Urschel missed this season’s first three games but is starting at left guard right now for the Ravens, who are beating up on my Giants 10-0 as we go to post. Oh well. Of course he reminds me I once took classes from an NFL quarterback who similarly “went for two” in football and mathematics. We wish Urschel all the best and will follow his careers with interest. Enjoy the games today and tomorrow night.
[de-LaTeXed numerals for better look]
Jamie Morgenstern is a researcher into machine learning, economics, and especially mechanism design.
Today Ken and I would like to discuss a joint paper of hers on the classic problem of matching schools and students.
The paper is titled, “Approximately Stable, School Optimal, and Student-Truthful Many-to-One Matchings (via Differential Privacy).” It is joint with Sampath Kannan, Aaron Roth, and Zhiwei Wu. Let’s call this paper (KMRW). She gave a talk on it at Georgia Tech last year.
There are various instances of matching-type problems, but perhaps the most important is the NRMP assignment of graduating medical students to their first hospital appointments. In 2012 Lloyd Shapley and Alvin Roth—Aaron’s father—were awarded the Nobel Prize in Economics “for the theory of stable allocations and the practice of market design.”
The original matching problem was that of marrying females and males, on which we just posted about a recent joint paper of Noam Nisan. Here is a standard description of the stable matching problem (SMP):
Assume there are men and women. Further each has a linear ranking of the members of the opposite sex. Find a way to marry all the men and all the women so that the assignment is stable. This means that no two people would both rather marry each other instead of the partners they are assigned to.
In 1962, David Gale and Shapley proved that any SMP always has a solution, and even better, they gave a quadratic time algorithm that finds it. SMP as stated is less practical than the problem that results if the men are also allowed to marry each other, likewise the women, with everyone ranking everyone else as partners. But in the same famous paper, Gale and Shapley solved an even more realistic problem:
The hospitals/residents problem—also known as the college admissions problem—differs from the stable marriage problem in that the “women” can accept “proposals” from more than one “man” (e.g., a hospital can take multiple residents, or a college can take an incoming class of more than one student). Algorithms to solve the hospitals/residents problem can be hospital-oriented (female-optimal) or resident-oriented (male-optimal). This problem was solved, with an algorithm, in the original paper by Gale and Shapley, in which SMP was solved.
Let’s call this the college admissions problem (CAP).
Consider adding another condition to CAP. Stability is clearly an important requirement—without it there would be students and colleges that would prefer to swap their choices. Having stability which avoids such a situation is quite nice. It does not force the assignment to be perfect in any sense, but does make it at least a “local” optimal solution. This is important, and it is used in real assignment problems today.
Yet there is a possibility that students, for example, could “game” the system. What if a student selected a strange ordering of schools that they claim they wish to attend. But the list is not really what they want. Why would they do this? Simply they may be able to lie about their preferences to insure that they get their top choice.
The basic point seems easiest to illustrate in the non-gender, single-party case, so let’s say need to form two pairs. They have circular first choices and second choices in the opposite circle:
Let’s say happens. and got their second choices, but it is not to their advantage to room together. They could have had their first choices under the also-stable configuration but there was no way to force the algorithm to choose it. However, let’s suppose and lie and declare each other to be their second choices:
The algorithm given these preferences sees as unstable since would join with , and the resulting as unstable because and would prefer each other. The resulting stable now gives and their first choices. Given what they may have known about ‘s and ‘s preference lists, there was no danger they’d actually have to pair up. Examples in the two-gender and student-college cases need six actors but the ideas are similar.
The basic Gale-Shapley algorithm for the two-party case actually has two “poles” depending on which party has its preferences attended to first. The current NRMP algorithm favors the applicants first. In college admissions one can imagine the schools going first. Alvin Roth proved that the poles had the following impact on truthfulness:
Theorem 5. In the matching procedure which always yields the optimal stable outcome for a given one of the two sets of agents (i.e., for or for ), truthful revelation is a dominant strategy for all the agents in that set.
Corollary 5.1. In the matching procedure which always yields the optimal stable outcome for a given set of agents, the agents in the other set have no incentive to misrepresent their first choice.
A dominant strategy here means a submitted list of preferences that the individual has no motive to change under any combination of preferences submitted by the other agents (on either side). The upshot is that the favored side has no reason to deviate from their true preferences, but the non-favored side has motive to spy on each other (short of colluding) and lie about their second and further preferences.
There has been much research on improving fairness by mediating between the “poles” but this has not solved the truthfulness issue. Can we structure the algorithm so that it still finds stable configurations but works in a way that both parties have incentive to be truthful and use their real orderings?
Unfortunately, Roth’s same paper proved that there is no method to solve SMP or CAP that is both stable and truthful. Only the above conditions on one side or the other are possible.
Roth’s negative result reminds me of the famous dialogue toward the end of the 1992 movie A Few Good Men:
Judge Randolph: Consider yourself in contempt!
Kaffee: Colonel Jessup, did you order the Code Red?
Judge Randolph: You don’t have to answer that question!
Col. Jessup: I’ll answer the question!
[to Kaffee]
Col. Jessup: You want answers?
Kaffee: I think I’m entitled to.
Col. Jessup: You want answers?
Kaffee: I want the truth!
Col. Jessup: You can’t handle the truth!
However, a recurring theme in theory especially since the 1980s is that often we can work around such negative results by:
Recall that in the college admissions case where the schools act first, the schools have no motive to do other than declare their true preferences, but the students might profit if they submit bogus second, third, and further preferences to the algorithm. The new KMRW paper states the following:
We present a mechanism for computing asymptotically stable school optimal matchings, while guaranteeing that it is an asymptotic dominant strategy for every student to report their true preferences to the mechanism. Our main tool in this endeavor is differential privacy: we give an algorithm that coordinates a stable matching using differentially private signals, which lead to our truthfulness guarantee. This is the first setting in which it is known how to achieve nontrivial truthfulness guarantees for students when computing school optimal matchings, assuming worst- case preferences (for schools and students) in large markets.
By “approximately stable” the KMRW paper means a relaxation of the third of three conditions on a function that maps students to colleges plus the “gap year” option . If left off his or her application list, that is read as preferring to . Symmetrically, may have an admission threshold that some students fall below.
The relaxation is that for some fixed , colleges may hold on admitting qualified students who might prefer them if they are within of their capacity:
There is also an approximation condition on the potential gain from lying about preferences. This requires postulating a numerical utility for student to attend college that is monotone in the true preferences. Given , say the preference list submitted by is -approximately dominant if for all other lists by —and all lists submitted by other students which induce the mappings with and with —we have
In fact, KMRW pull the randomization lever by stipulating the bound only on the difference in expected values over matchings produced by the randomized algorithm they construct. Then this says that no student can expect to gain more than utility by gaming the system.
Confining the utility values to ensures that any is larger than some differences if the number of colleges is great enough, so this allows some slack compared to strict dominance which holds if . It also enables the utility values to be implicitly universally quantified in the condition by which and “mesh” with , the number of students, and the minimum capacity of any college:
Theorem 1 Given and , there are universal constants and an efficient randomized algorithm that with high probability produces a feasible -approximately stable solution in which submitting the true preferences is -approximately dominant in expected value for each student, provided that and
This is an informal statement. The mechanism underlying the proof also works for not fixed: provided grows faster than , the approximate stability and probable-approximate truth-telling dominance can be achieved with and both tending toward . It is neat that the approximations are achieved using concepts and tools from differential privacy, which we have posted about before. By analogy with PAC, we might summarize the whole system as being “probably approximately truthful.”
The notion of truth is fundamental. In logic the notion of truth as applied to mathematical theories is central to the Incompleteness Theorems of Kurt Gödel and Alfred Tarski.
In economics the notion of truth is different but perhaps even more important. Imagine any set of agents that are involved in some type of interaction: it could be a game, an auction, or some more complex type of interaction. Typically these agents make decisions, which affect not only what happens to them, but also to the other agents. In our “” example above, the further point is that “gaming the system” by and lowered the utility for and . But and could have tried the same game.
The effect on others highlights that this is a basic problem. It seems best when the interaction does not reward agents that essentially “lie” about their true interests. This speaks our desire for a system that rewards telling the truth—and takes the subject into the area of algorithmic mechanism design which we also featured in the post on Nisan. Indeed truth is addressed in his Gödel Prize-winning paper with Amir Ronen, for instance defining a mechanism to be strongly truthful if truth-telling is the only dominant strategy.
That paper follows with a sub-sub-section 4.3.1 on “basic properties of truthful implementations”—but what I’m not finding in these papers are theorems that tell me why truthfulness is important in economic interactions. It sounds self evident, but is it? There are many papers that show one cannot force agents to be truthful, and there are other results showing cases in which the agents’ individual best interests are to be truthful after all. I understand why a solution to a matching problem should be stable, but am not convinced that it needs to be truthful. In mathematics we can define a property that we add to restrict solutions to some problem, but we usually need to justify the property. If we are solving a equation, we may restrict the answers to be integers. The reason could be as simple as non-integer answers making no sense, such as buying 14.67 cans of Sprite for a party.
I get that being truthful does stop some behavior that one might informally dislike. What I feel as an outsider to the research into matching problems is simple: where is the theorem that shows that adding truthful behavior has some advantage? It is true in the analogous case of auctions that they can be designed so that truthful bidding is provably a dominant strategy, and plausibly this matters to competitors agreeing to and paying for the auction mechanism. Perhaps there is a “meta” game level where mechanism designs are strategies and there is a communal payoff function, in which truth-inducing mechanism designs may be optimal strategies. But overall I am puzzled. Perhaps this just shows how naive I am about this line of work.
What are your reactions on the importance of inducing truth-telling? To show at least dewy diligence on our part, here are a few good references. What would Gödel—who conveyed certain of his own mechanism design opinions to his friend Oskar Morgenstern (no relation)—say?
]]>
The winner of the 2016 ACM-IEEE Knuth Prize
Coursera source |
Noam Nisan has been one of the leaders in computational complexity and algorithms for many years. He has just been named the winner of the 2016 Donald E. Knuth Prize.
Today we congratulate Noam and highlight some of his recent contributions to algorithmic economics and complexity.
I (Dick) think that Noam missed out on proper recognition in the past. I am thrilled that he is finally recognized for his brilliant work. I do differ with the impression given by ACM’s headline. His game theory work is clearly important, but his early work on pseudorandom generators was of first order. And personally I wonder if that almost alone would be enough to argue that Noam is one of the great leaders in complexity theory—let alone the work on communication complexity and interactive proofs.
I (Ken) had forgotten that he was the ‘N’ in the LFKN paper with Carsten Lund, Lance Fortnow, and Howard Karloff proving that interactive proofs can simulate the polynomial hierarchy, which presaged the proof of . The first sentence of the ACM article does list these three complexity areas before mentioning game theory. There are others: Noam co-authored with Nathan Linial and Yishay Mansour a seminal 1993 paper that promoted Fourier analysis to study Boolean circuits, and with Mario Szegedy a 1994 paper on the polynomial method that impacted a host of topics including Boolean function sensitivity, decision trees, and quantum query lower bounds.
In what Lance once termed “walking away” from complexity, Noam in the mid-1990s became interested in algorithmic economics. Really we should say it is the interface between complexity and economics. We can say that theory—in particular mathematical game theory—frames much of this interface, but the business end of it is social. The genius we note in his seminal paper with Amir Ronen, titled “Algorithm Mechanism Design,” is in mapping the social elements to structures in distributed computation and communication theory (routing and protocols as well as communication complexity) that were already developed. This paper was one of three jointly awarded the 2012 Gödel Prize. Regarding it, the Knuth citation says:
A mechanism is an algorithm or protocol that is explicitly designed so that rational participants, motivated purely by their self-interest, will achieve the designer’s goals. This is of paramount importance in the age of the Internet, with many applications from auctions to network routing protocols. Nisan has designed some of the most effective mechanisms by providing the right incentives to the players. He has also shown that in a variety of environments there is a tradeoff between economic efficiency and algorithmic efficiency.
The last part of this addresses a question of more general import to us in theory:
How much impact can complexity lower bounds and conditional hardness relationships have in the real world?
The impact is helped along when the lower bounds come from communication complexity. Bounds on the number of bits that must be communicated to achieve a common goal (with high probability) are generally more concrete and easier to establish than those in the closed system of classical complexity theory, whose internal nature makes the most central bound questions asymptotic.
Noam long ago co-authored the textbook Communication Complexity with Eyal Kushilevitz, and also co-edited the 2007 text Algorithmic Game Theory with Tim Roughgarden, Éva Tardos, and Georgia Tech’s own Vijay Vazirani. The whole package of great breadth as well as depth in foundational areas, imbued with computer science education, ably fits the Knuth Prize profile. But we are going to make a forward-looking point by exploring how some of his recent joint papers reach a synthesis that includes some implications for complexity.
“Public Projects, Boolean Functions, and the Borders of Border’s Theorem,” with Parikhshit Gopalan and Tim Roughgarden, in the 2015 ACM Conference on Economics and Computation. Border’s Theorem is an instance where the feasible space of an exponential-sized linear program has a natural projection onto the space of a polynomial-sized one. The linear programs express allocations of -many bid-for goods to bidders according to their expressed and private valuations of those goods. The paper shows complexity obstacles to extending this nice exponential-to-polynomial property to other economic situations.
The overall subject can be read about in two sets of lecture notes by Roughgarden. We adapt an example from section 2 of this lecture to give the flavor: Suppose you have 2 items for sale and one prospective buyer such that for each item there is a 50-50 chance the buyer is willing to pay $1 for it, but you lose $1 for each item you fail to sell (call it shelving cost). If you price each item at $1, you will expect to make one sale, reaping $1 but also losing $1 on the unsold item, for zero expected net revenue. That’s no better than if you gave each away for $0. If you set the price in-between, say $0.50, you will expect a net loss—because the buyer’s probability distribution of value is discrete not smooth. But if you bundle them at 2-for-$1, you will expect to sell the bundle three-fourths of the time, for expected net revenue ($1 + $1 + $1 – $2)/4 = $0.25, which is positive.
Our change from Roughgarden’s prices and values of $1,$2 is meant to convey that problems about (random) Boolean functions and set systems are lurking here. Suppose we have items; what is the optimal bundling strategy? Grouping them at $2-for-4 expects to net the same $0.125 per item as the above grouping $1-for-2. But grouping $1-for-3 expects to reap $1 seven-eights of the time and lose $3 one-eighth of the time, for $4/8 = $0.50 net for 3 items, which gives a slightly better $0.167 per item. Is this optimal? We could write a big linear program to tell—also in situations with other prices and distributions of buyer values and conditions.
In several of these situations, they show that a Border-like theorem would put a -hard problem into and hence collapse the polynomial hierarchy. A technical tool for later results in the paper is the expectation of a Boolean function over a random assignment and for each the expectation of , which constitute the zeroth and first-degree coefficients of the Fourier transform of . Certain -hard problems about vectors of these coefficients are related to economics problems for which smaller LPs would collapse them.
“Networks of Complements,” with Moshe Babaioff and Liad Blumrosen, in ICALP 2016. Imagine a situation where buyers will only buy pairs of items in bundles. We can make an undirected graph whose nodes are the items and each edge is weighted by the price the buyer is willing to pay. Graph theory supplies tools to analyze the resulting markets, while behavior for special kinds of graphs may yield insights about the structure of problems involving them.
“A Stable Marriage Requires Communication,” with Yannai Gonczarowski, Rafail Ostrovsky, and Will Rosenbaum, in SODA 2015. The input consists of permutations and of . A bijection is unstable if there exist such that and , otherwise it is stable. Figuratively, the latter inequality means that woman prefers man to her own husband and the former means that man likewise prefers woman to his own wife. There always exist “marriage functions” that are stable, but the problem is to find one.
This is a classic example of a problem whose best-known algorithms run in time order-of in worst case, but time in average case for permutations generated uniformly at random. This random model may not reflect “practical case” however, so the worst-case time merits scrutiny for possible improvement, perhaps under some helpful conditions. One avenue of improvement is not to require reading the whole input, whose size when the permutations are listed out is already order-of (ignoring a log factor by treating elements of as unit size), but rather allow random access to selected bits. These queries can be comparisons or more general. The paper finds a powerful new reduction from the communication complexity of disjointness (if Alice and Bob each hold an -bit binary string, verify there is no index where each string has a ) into preference-based problems to prove the query number must be even for randomized algorithms and even with certain extra helpful conditions.
“Smooth Boolean Functions are Easy: Efficient Algorithms for Low-Sensitivity Functions,” with Parikhshit, Rocco Servedio, Kunal Talwar, and Avi, in the 2016 Innovations conference.
The sensitivity of (a family of) -variable Boolean functions is the minimum over of the number of adjacent to such that . Here adjacent means and differ in one bit. The functions are smooth if , or variously, if . One way to achieve is for to be computed by a depth- decision tree, since along each branch determined by an there are at most bits on which the branch’s assigned value depends. Noam, first solo and then in the 1994 paper with Szegedy, conjectured a kind of converse: there is a universal constant such that every has a decision tree of depth .
Decision trees are broadly more powerful in corresponding depth and size measures than Boolean circuits, which in turn are more powerful than formulas. Although belief in Noam’s conjecture has grown, no one had determined upper bounds for circuits and formulas. The paper gives circuits of size and formulas of depth with somewhat laxer size . The size is scarcely the same as a polynomial in , but it’s a start, and when is polylog, the size is quasipolynomial and the depth is polylog. This most-recent paper will hopefully jump-start work on proving the conjecture and fundamental work on Boolean functions.
Of course we have given only a cursory overview intended to attract you to examine the papers in greater detail, even skipping some of their major points which are welcome in comments.
We congratulate Noam warmly on the prize and his research achievements.
We can announce one new thing: LEMONADE. No it is not Beyoncé’s latest album but rather a new podcast series by Dick and Kathryn on positive approaches to problematic issues in academia, culture, and life in general.
[fixed “size”->”depth” in statement of sensitivity conjecture]