UK Independent source—and “a gentle irony” |
Roger Bannister is a British neurologist. He received the first Lifetime Achievement Award from the American Academy for Neurology in 2005. Besides his extensive research and many papers in neurology, his 25 years of revising and expanding the bellwether text Clinical Neurology culminated in being added as co-author. Oh by the way, he is that Bannister who was the first person timed under 4:00 in a mile race.
Today I cover another case of “Big Data Blues” that has surfaced in my chess work, using a race-timing analogy to make it general.
Sir Roger also served as Head of Pembroke College, one of the constituents of Oxford University. He was one of three august Rogers with whom I interacted about IBM PC-XT computers when the machines were installed at Oxford in 1984–1985. Sir Roger Penrose was among trustees of the Mathematical Institute’s Prizes Fund who granted support for my installation of an XT-based mathematical typesetting system there, a story I’ve told here. Roger Highfield and his secretary used an XT in my college’s office, and I was frequently called in to troubleshoot. While drafting this post last month, I received a mailing from Sir Martin Taylor that Dr. Highfield had just passed away—from his obit one can see that he, too, received admission to a royal order.
Dr. Bannister was interested in purchasing several XTs for scientific as well as general purposes at Pembroke. At the time, numerical performance required purchasing a co-processor chip, adding almost $1,000 to what was already a large outlay per machine by today’s standards. I wish I’d thought to say in a quick deadpan voice, “let it run four minutes and it will give you a mile of data.” (Instead, I think the 1954 race never came up in our conversation.) Today, however, data outruns us all. How to keep control of the pace is our topic.
Roger Bannister 50-year commemorative coin. Royal Mint source. |
As shown above in the commemorative coin’s design, the historic 3:59.4 time was recorded on stopwatches. We’ll stay with this older timing technology for our example.
Suppose you have a field of 200 milers. Suppose you also have a box of 50 stopwatches. For each runner you pick a stopwatch at random and measure his/her time. You get results that closely match the histogram of times that were recorded for the same runners in trials the previous day.
How good is this? You can be satisfied that the box of watches does not have a systematic tendency to be slow or to be fast for runners at that mix of levels. Projections based on such fields are valid.
The rub, however, is that you could have gotten your nice fit even if each individual watch is broken and always returns the same time. Suppose your field included Bannister, John Landy, Jim Ryun, and Sebastian Coe, with each in his prime. They would probably average close to 3:55. Hence if one of the 50 watches is stuck on 3:55, it will fit them well. It doesn’t matter if you actually draw the watch when measuring the last-place finisher. The point is that you expect to draw the watch 4 times overall and are fitting an aggregate.
Indeed, you only need the distribution of the (stopped) watches to match the distribution of the runners under random draws. You may measure a close fit not only in the quantiles but also the higher moments, which is as good as it gets. Your model may still work fine on tomorrow’s batch of runners. But at the non-aggregate level, what it did in projecting an individual runner was vapid.
Here is a hypothetical example in the predictive analytic domain of my chess model. Consider a model used by a home insurance company to judge the probabilities of damage by earth movement, wind, fire, or flood and price policies accordingly. I’ve seen policies with grainy risk-scale levels that apply to several hundred homes in a given area at one time. The company only needs good performance on such aggregates to earn its profit.
But suppose the model were fine-grained enough to project probabilities on individual homes. And suppose it did the following:
This is weird but might not be bad. If the risks average out over several hundred homes, a model like this might perform well—despite the consternation homeowners would feel if they ever saw such individual projections.
Of course, “real” models don’t do this—or do they? The expansion of my chess model which I described last Election Day has started doing this. It fixates on some moves but gives near-zero probability to others—even ones that were played—while giving fits 5–50x sharper than before. If you’ve already had experience with behavior like the above, please feel welcome to jump to the end and let us know in comments. But to see what lessons to learn from how this happens in my new model, here are details…
My chess model assigns a probability to every possible move at every game turn , based only on the values given to those moves by strong computer chess programs and parameters denoting the skill profile of a formal player . The programs list move options in order of value for the player to move, so that is the raw inferiority of in chess-specific centipawn units.
The model asserts that the parameters can be used to compute dimensionless inferiority values , from which projected probabilities are obtained without further reference to either parameters or data. The old model starts with a function that scales down the raw difference according to the overall position value . Then it defines
Lower and higher both decrease the probability of playing a sub-optimal move by dint of driving higher. The effect of is greatest when is low, so is interpreted as the player’s “sensitivity” to small differences in value, whereas governs the frequency of large mistakes and hence is called “consistency.” My conversion represents each as a power of the best-move probability , namely solving the equations
where is the number of legal moves in the position. The double exponential looks surprising but can be broken down by regarding as a “utility share” expressed in proportion to the best move’s utility , then . Alternate formulations can define directly from and the parameters, e.g. by , and/or simply normalize the shares by rather than use powers, but they seem not to work as well.
This “inner loop” defines as a probability ensemble given any point in the parameter space. The “outer loop” of regression needs to determine the that best conforms to the given data sample. The determine projections for the frequency of “matching” the computer’s first move and the “average scaled difference” of the played moves by:
The regression makes these into unbiased estimators by matching them to the actual values and in the sample. We can view this as minimizing the least-squares “fitness function”
where the weights on the individual tests are fixed ad-lib. In fact, my old model virtually always gets , thus solving two equations in two unknowns. Myriad alternative fitness functions using other statistical tests and weights help to judge the larger quality of the fit and cross-validate the results.
In my original model, all is good. My training sets for a wide spectrum of Elo ratings yield best-fit values that not only give a fine linear fit with residuals small across the spectrum, but the individual sequences and also give good linear fits to . Moreover, for all and positions the projected probabilities derived from have magnitudes that spread out over the reasonable moves .
My old model is however completely monotone in this sense: The best move(s) always have the highest , regardless of . Moreover, an uptick in the value of any move increases for every . This runs counter to the natural idea that weaker players prefer weaker moves.
The new model postulates a mechanism by which weaker moves may be preferred by dint of looking better at earlier stages of the search. A new measure called “swing” is positive for moves whose high worth emerges only late in the search, and negative for moves that look attractive early on but end with subpar values. The latter moves might be “traps” set by a canny opponent, such as the pivotal example from the 2008 world championship match discussed here.
A player’s susceptibility to “swing” is modeled by a new parameter called for “heave” as I described last November. The basic idea is that represents the “subjective value” of the move , so that represents the subjective difference in value. The idea I actually use applies swing to adjust the inferiority measure:
where is a fourth parameter and for negative is defined to be . Dropping from the second term and raising it to just the not power would be mathematically equivalent, but coupling the parameters makes it easier to try constraining and/or . (In fact, I’ve tried various other combinations and tweaks to the formulas for and , plus four other parameters kept frozen to default values in examples here. None so far has changed the picture described here.)
Note that the formulas for preserve the property for the first-listed move . When has equal-optimal value, that is , cannot be negative and is usually positive. That makes and hence reduces the share compared to . The first big win for the new model is that it naturally handles a puzzling phenomenon I identified years ago, for which my old model makes an ad-hoc adjustment.
The second big win is that can be negative even when —the swing term overpowers the other. This means the model projects the inferior move as more likely than the engine’s optimal move. This is nervy but in many cases my model correctly “foresees” the player taking the bait.
The third big win—but tantalizing—is that the extended model not only allows solving 2 more equations but often makes other fitness tests align like magic. The first of the following choices of extra equations makes an unbiased estimator for the frequency of playing a move of equal value to , which became my third cheating test after its advocacy in this paper (see also reply in this):
A typical fit that looks great by all these measures is here. It has 26,450 positions from all 497 games at standard time controls with both players rated between Elo 2040 and Elo 2060 since 2014 that are collected in the encyclopedic ChessBase Big 2017 data disc. It shows for to , then and tests related to it, then is repeated between and , and finally come four cases of for 0.01–0.10, 0.11–0.30, 0.31–0.70, and 0.71–1.50, plus four with .
Only , , , and were fitted on purpose. All the other tests follow closely like baby ducks in a row, except for some like captures and advancing versus retreating moves where human peculiarities may be identified. The value of is 5–10x as sharp as what my old model typically achieves. The new model seems to be confirming itself across the board and fulfilling the goal of giving accurate projected probabilities for all moves, not just the best move(s). What could possibly be amiss?
The first hint of trouble comes from the fitted value of being . In my old model, players rated 2050 give between and , while even the best players give . Players rated 2050 are in amateur ranks and leaves no headroom for masters and grandmasters. The value of compounds the sharpness; together with , a slight value difference (say) gets ballooned up to , giving and , which shrinks near 1-in-5,000 when and below 1-in-650,000 when . This is weirdly small—and we have not even yet involved the effects of the swing term with .
Those effects show up immediately in the file. I skip turns 1–8, so White’s 9th move is the first item. Black has just captured a pawn and White has three ways to re-take, all of them reasonable moves according to the Stockfish 7 program. Here is how my new model projects them:
NRW Class1 1314;Germany;2014.02.02;6.4;Franke, Thomas;Doennebrink, Elmar;1-0 r1b1k2r/pp2bppp/2n1pn2/3q4/3p4/2PBBN2/PP3PPP/RN1Q1RK1 w kq - 0 9; c3xd4, engine c3xd4 Eval 0.24 at depth 21; swap index 1 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 c3xd4: 0.00 0.000 0.000 0.000 0.000 1 0.79527569 2 2 Nf3xd4: 0.42 0.321 -0.035 -0.034 -0.034 0.144422 0.20472428 3 3 Be3xd4: 0.55 0.395 0.008 0.005 0.005 0.00445313 0.00000001
That’s right—it gives zero chance of a 2050-player taking with the Bishop, even though Stockfish rates that only a little worse than taking with the Knight. True, human players would say 9.Bxd4 is a stupid move because it lets Black gain the “Bishop pair” by exchanging his Knight for that Bishop. Of 155 games that ChessBase records as reaching this position, 151 saw White recapture by 9.cxd4, 4 by 9.Nxd4, and none by 9.Bxd4. So maybe the extremely low projection—for 9.Bxd4 and all other moves—has a point. But to give zero? The is the utility share, so the is actually about ; the is an imposed minimum. My original model—setting and fitting only and —spreads out the probability nicely, maybe even too much here:
M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 c3xd4: 0.00 0.000 0.000 0.000 0.000 1 0.57620032 2 2 Nf3xd4: 0.42 0.321 -0.035 -0.034 -0.034 0.280586 0.14018157 3 3 Be3xd4: 0.55 0.395 0.008 0.005 0.005 0.241178 0.10168579
At Black’s 11th turn, however, the new model gives three clearly wrong “zero” projections:
NRW Class1 1314;Germany;2014.02.02;6.4;Franke, Thomas;Doennebrink, Elmar;1-0 r1b2rk1/pp2bppp/2nqpn2/8/3P4/P1NBBN2/1P3PPP/R2Q1RK1 b - - 0 11; Rf8-d8, engine b7-b6 Eval 0.11 at depth 20; swap index 1 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 b7-b6: -0.00 -0.000 0.000 0.000 0.000 1 0.56792559 2 2 Nf6-g4: 0.18 0.163 -0.001 -0.002 -0.002 0.0907468 0.00196053 3 3 Rf8-d8: 0.18 0.163 0.042 0.046 0.046 0.00391154 0.00000001 4 4 Bc8-d7: 0.21 0.187 -0.029 -0.030 -0.030 0.278053 0.13071447 5 5 Nf6-d5: 0.28 0.241 0.047 0.050 0.050 0.00218845 0.00000001 6 6 a7-a6: 0.30 0.256 -0.049 -0.051 -0.051 0.28777 0.14001152 7 7 Qd6-c7: 0.31 0.264 -0.012 -0.012 -0.012 0.097661 0.00304836 8 8 g7-g6: 0.37 0.306 0.015 0.017 0.017 0.00355675 0.00000001 9 9 Qd6-d8: 0.39 0.320 -0.054 -0.051 -0.051 0.206264 0.06438231 10 10 Qd6-b8: 0.39 0.320 -0.037 -0.038 -0.038 0.158031 0.02787298
Owing to many other games having “transposed” here by a different initial sequence of moves, Big 2017 shows 911 games reaching this point. In 683 of them, Black played the computer’s recommended 11…b6. None played the second-listed move 11…Ng4, which reflects well on the model’s giving it a tiny . But the third-listed move 11…Rd8 gets a zero despite having been chosen by 94 players. Then 91 played the sixth-listed 11…a6, which actually gets the second-highest nod from the model, and 22 played 11…Bd7, which the new model considers third most likely. But 12 players chose 11…Nd5, four of them rated over 2300 including the former world championship candidate Alexey Dreev in a game he won at the 2009 Aeroflot Open. My old model’s fit of the same data gives 34.8% to 11…b6, 10.4% to 11…Ng4 and 7.5% to 11…Rd8 with the ad-hoc change for tied moves (would be 8.7% to both without it), and 5.1% to 11…Nd5, with eighteen moves getting at least 1%.
To be sure, this is a well-known “book” position. The 75% preference for 11…b6 doubtless reflects players’ knowledge of past games and even the fact that Stockfish and other programs consider it best. It is hard to do a true distributional benchmark of my model in selected positions because the ones with enough games are exactly the ones in “book.” Studies of common endgame positions have been tried then and now, but with the issue that the programs’ immediate complete resolution of these endgames seems to wash out much of the progression in thinking and differentiation of player skill that one would like to capture. (My cheating tests exclude all “book-by-2300+” positions and all with one side ahead more than 3.00.) Most to the point, the fitting done by my model on training data is supposed to be already the distributional test of how players of that rating class have played over many thousands of instances.
The following position is far from book and typifies the most egregious kind of mis-projection:
SVK-chT1E 1314;Slovakia;2014.03.23;11.6;Debnar, Jan;Milcova, Zuzana;1-0 2r4k/pp5p/2n5/2P1p2q/2R1Qp1r/P2P1P2/1P3KP1/4RB2 b - - 1 32; Qh5-g5, engine Qh5-g5 Eval 0.01 at depth 21; swap index 2 and spec AA2050SF7w4sw10-19: (InvExp:1), Unit weights with s = 0.0083, c = 0.3846, d = 12.5000, v = 0.0500, a = 0.9863, hm = 1.8024, hp = 1.0000, b = 1.0000: M# Rk Move RwDelta ScDelta Swing SwDDep SwRel Util.Share ProjProb'y 1 1 Qh5-g5: 0.00 0.000 0.129 0.137 0.000 0.0347142 0.00018527 2 2 Rc8-d8: 0.00 0.000 0.026 0.025 -0.112 1 0.74206054 3 3 a7-a6: 0.09 0.085 -0.008 -0.005 -0.142 0.117659 0.07922164 4 4 Rc8-g8: 0.21 0.187 -0.092 -0.087 -0.225 0.0996097 0.05003989 5 5 Rh4-h1: 0.25 0.219 -0.166 -0.165 -0.302 0.136659 0.11270482
This has two tied-optimal moves for Black in a position judged +0.01 to White, not a flat 0.00 draw value, yet the one that was played gets under a 1-in-5,000 projection. Here are the by-depth values that produced the high positive value:
-------------------------------------------------------------------------------------------- 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -------------------------------------------------------------------------------------------- Qg5 -117 -002 -008 +000 -015 +008 +017 +032 +011 +007 +000 +004 +001 +000 +000 +006 +001 Rd8 -089 -058 -036 -032 -013 -025 +006 +000 -010 +000 -012 -013 +001 +014 +001 +001 +001
The numbers are from White’s view, so what happened is that 32…Rd8 looked like giving Black the advantage at depths 10, 13, and 15–16, whereas 32…Qg5 looked significantly inferior (to Stockfish 7) at depth 12 and nosed in front only at depth 20 just before falling into the tie. The swing computation begins at depth 10 to evade the Stockfish-specific strangeness I noted here last year, so in particular the “rogue” values at depth 5 (and below) are immaterial. The values and differences from depth 10 onward are all relatively gentle. Hence their amounting to a tiny versus and microscopic is a sudden whiplash.
What I believe is happening to the fit is hinted by this last example giving the highest probability to the 2nd-listed move. Our first game above has two positions where the 9th-listed move gets the love. (The second, shown in full here, is notable in that the second-best move gets a zero though it is inferior by only 0.03 and was played by all three 2200+ players in the book.) This conforms to the goal of projecting when weaker players will prefer weaker moves.
This table shows that the new model quite often prefers moves other than , compared to how often they are played:
To be sure, the model is not putting 100% probability on these preferred moves, but when preferred they get a lot more probability than under my old model, which never prefers a move other than . Recall however that my old model’s fit was not too far off on these indices—and both models are fitted to give the same total probability to over all positions . Hence the probability on inferior moves is conserved but more concentrated.
Yes, greater concentration was the goal—so as to distinguish the most plausible inferior moves. But the above examples show a runaway process. The new model seems to be seizing onto properties of the distribution alone. For each we can define to be the move with the most negative value of . The also form a histogram over . The fitting process can grab it by putting all weight on plus at most a few other moves at each turn .
These few moves are the “stopped-watch reading” in my analogy. The moves given zero are the readings that cannot happen for a given runner/position. The fitting doesn’t care whether moves getting zero were played, so long as other turns fill in the histogram. If a high for —as with 32…Rd8 above—fills a gap, the fit will gravitate toward values of and that beat down all the moves with at such turns . In trials on other data, I’ve seen crash under while zooms aloft in a crazy race.
What can fix this? The maximum likelihood estimator (MLE) in this case involves minimizing the log-sum of the projected probabilities of the moves played at turn . Adding it as a weighted component of helps a little by inflating the probability of the moves that were played, but so far not a lot. Even more on-point may be maximum entropy (ME) estimation, which in this case means minimizing
There are various other ways to fit the model, including a quantiling idea I devised in my AAAI 2011 paper with Guy Haworth. In principle, and because the training data is copious, it is good to have these ways agree more than they do at present. Absent a lightning bolt that fuses them, I am finding myself locally tweaking the model in directions that optimize some “meta-fitness” function composed from all these tests.
Is this a known issue? Does it have a name? Is there a standard recipe for fixing it?
Do any deployed models have similar tendencies that aren’t noticed because there isn’t the facility for probing deeper into the grain that my chess model enjoys?
[added “at standard time controls”, a few other word changes]
Cropped from source |
Bill Clinton was the 42nd President of the United States. He came close to becoming the first First Gentleman—or whatever we will call the husband of a female president. He is also a fan of crossword puzzles, and co-authored with Victor Fleming a puzzle for this past Friday’s New York Times.
Today we discuss an apparently unintended find in his puzzle. It has a Mother’s Day theme.
The puzzle was widely publicized as having a “secret message” or “Easter egg.” Many crossword puzzles have a theme constituted by the longer answers, but the Friday and Saturday NYT puzzles are usually themeless. They are also designed to be the hardest in a progression that begins with a relatively easy Monday puzzle each week. The online renditions are subscriber-only, but the Times opened this puzzle freely to the public, so you are welcome to try to solve it and find the “hidden” content before we give it away.
In a previous post we featured Margaret Farrar, the famous first crossword editor for the Times, and described how the puzzles look and work. Proper nouns such as CHILE the country, standard abbreviations, and whole phrases are fair game as answers, and they are rammed together without spaces or punctuation. For instance, the clue “Assistance for returning W.W. II vets” in Clinton’s puzzle produces the answer GIBILL. (My own father, returning from the occupation of Japan, completed his college degree under the G.I. Bill.) Some clues are fill-in-the-blank, such as “Asia’s ____ Sea” in the puzzle.
The intended hidden message is formed from three long answers symmetrically placed around the puzzle’s center. It is the signature line from a 1977 Fleetwood Mac song that Clinton has used since his 1992 presidential campaign. If you expected the puzzle to have a theme, these three lines would obviously be it.
An “Easter egg” is a side feature, usually small and local and often, as Wikipedia says, an inside joke. When I printed and did the puzzle over lunch on Friday, I missed the intended content because it wasn’t the kind I was looking for. But I did find something one can call an “Eester gee” involving the three shorter clues and answers mentioned above:
My eye had been drawn by finding Bill in his own puzzle. Winding through him is HILLAREE, indeed in three different ways but with EE in place of Y. Straining harder, one can extract CHEL- from CHILE and get -Sea from the clue for ARAL just underneath to find Chelsea, the Clintons’ only daughter.
Admittedly this is both stilted and cryptic, but it is singularly tied to the former First Family and appropriate just before Mother’s Day. Was this hidden by intent, or was it hiding by accident? Presuming the latter, what does this say about the frequency with which we can find unintended patterns? This matters not only to some historical controversies but also to cases of alleged plagiarism of writing and software code, even this investigation over song lyrics being planted in testimony.
Can we possibly judge the accidental frequency of such subjective patterns? Clinton’s puzzle allows us to experiment a little further. His only grandchild, Chelsea’s daughter, is named Charlotte. Can we find her in the same place?
Right away, CHILE and ARAL give us CHAR in a square, a promising start. There are Ls nearby, but no O. Nothing like “Lenya” or a ‘Phantom‘ reference is there to clue LOTTE. The THREE in our grid is followed by TON to answer the clue, “Like some heavy-duty trucks,” but getting the last four needed letters from there lacks even the veneer of defense of my using the I in CHILE as a connector. Is three tons a “lot”? No doting grandpa would foist that on a child. So we must reject the hypothesis that she is present.
We can attack the CHILE weakness in a similar manner. The puzzle design could have used CHELL, the player character of the classic video game Portal. HILLAREE would still have survived by using the I in Bill. However, the final L would have come below the N in the main-theme word THINKING, and it is hard to find natural answer words ending in NL. So our configuration has enough local optimality to preserve the contention that Chelsea is naturally present. Whether it is truly natural remains dubious, but it dodges this shot at refutation.
Going back, how should we regard the false-start on Charlotte? We should not be surprised that it got started. That she shares the first two letters with Chelsea may have been “correlated” if not expressly purposeful. Such correlations are a major hard-to-handle factor in cases of suspected plagiarism or illicit signaling, as both Dick and I can attest generally from experience.
Of course, this is more the stuff of potboilers and conspiracy theories than serious research. That hasn’t stopped it from commanding the input of some of our peers, however. The best-selling 1997 book The Bible Code, following a 1994 paper, alleges that sequences of Hebrew letters at fixed-jump intervals in the Torah—the first five books of the Hebrew Bible—form sensible prophetic messages to a degree far beyond statistical expectation.
The fact that Hebrew skips many vowels helps in forming patterns. For instance, arranging the start of Genesis into a 50-column crossword yields TORaH in column 6, and as Wikipedia notes here, exactly the same happens in column 8 at the start of Exodus. Even just among the consonants, some alleged messages have glitches and skips like ours with HILLAREE and CHILE. Where is the line between patching-and-fudging and true statistical surprise? Our friend Gil Kalai was one of four authors of a 1999 paper delving deep into the murk. They didn’t just critique the 1994 paper, they conducted various experiments. Some were akin to ours above with CHARLOTTE, some could be like trying to find unsavory Clinton associations in the same puzzle, and the largest was replicating many of the same kind of finds in a Hebrew text of War and Peace.
The controversy over the genesis of William Shakespeare’s plays has notoriously involved allegedly hidden messages, most famously stemming from the 1888 book The Great Cryptogram supporting Francis Bacon as their true author. Two other major claimants, Edward de Vere (the seventeenth Earl of Oxford) and Christopher Marlowe, are hardly left out. Indeed, they both get crossword finds in the most prominent place of all, the inscription on Shakespeare’s funerary monument in Stratford, England:
The inscription is singular in challenging the “passenger” (passer-by) to “read” who is embodied within the Shakespeare monument. His tomb proper is nearby in the ground. Supporters of de Vere arrange the six parts of the Latin preface into a crossword and find their man in column 2:
The leftover OL is a blemish but it might not be wasted—it could refer to “Lord Oxford” in like manner to how “Mr. W.H.” in the dedication to Shake-speares Sonnets plausibly refers to Henry Wriothesley, the Earl of Southampton, who was entreated to marry one of Oxford’s daughters throughout 1590–1593.
Supporters of Marlowe volley back in the style of a British not American crossword. Their answer constructs this part of the inscription as a cryptic-crossword clue:
Whose name doth deck this tomb, far more, then cost.
The only name on Shakespeare’s tomb is Jesus, and the Oxford English Dictionary registers ley as an old word for a bill or tax, generically a cost. The answer to the monument’s riddle thus becomes CHRISTO-FAR MORE-LEY, which is within the convex hull of how Marlowe’s name was spelled in his lifetime. The subsequent SIEH, which is most simply explained as a typo for SITH meaning “Truly,” is constructed by modern cryptic-crossword convention as “HE IS returned,” in line with theories that Marlowe’s 1593 murder was actually staged to put him under deep cover in the Queen’s secret service.
What to make of these two readings? The only solid answer Dick and I have is the same as when we are sent a claimed proof of one week and one of the next:
They can’t both be right.
Or—considering that Marlowe has recently been credited as a co-author of Shakespeare’s Henry VI cycle, and that William Stanley, who completes Wikipedia’s featured quartet of claimants, wound up marrying the above-mentioned daughter of Oxford—perhaps they can.
Where do you draw the lines among commission, coincidence, and contrivance? Where does my Clinton crossword finding fall?
Happy Mother’s Day to you and yours as well.
[fixed description of Chell character, “seventh”->”seventeenth”, added ref. to song-lyrics case, some wording tweaks]
Alternate photo by Quanta |
Thomas Royen is a retired professor of statistics in Schwalbach am Taunus near Frankfurt, Germany. In July 2014 he had a one-minute insight about how to prove the famous Gaussian correlation inequality (GCI) conjecture. It took one day for him to draft a full proof of the conjecture. It has taken several years for the proof to be accepted and brought to full light.
Today Ken and I hail his achievement and discuss some of its history and context.
Royen posted his paper in August 2014 with the title, “A simple proof of the Gaussian correlation conjecture extended to multivariate gamma distributions.” He not only proved the conjecture, he recognized and proved a generalization. The “simple” means that the tools needed to solve it had been available for decades. So why did it elude some of the best mathematicians for those decades? One reason may have been that the conjecture spans geometry, probability theory, and statistics, so there were diverse ways to approach it. A conjecture that can be viewed in so many ways is perhaps all the more difficult to solve.
Even more fun is that Royen proved the conjecture after he was retired and had the key insight while brushing his teeth—as told here. Ken recalls one great bathroom insight not in his research but in chess: In the endgame stage of the famous 1999 Kasparov Versus the World match, which became a collaborative research activity later described by Michael Nielsen in his book, Reinventing Discovery, Ken had a key idea while in the shower. His idea, branching out from the game at 58…Qf5 59. Kh6 Qe6, was the Zugzwang maneuver 60. Qg1+ Kb2 61. Qf2+ Kb1 62. Qd4!, which remains the only way for White to win.
Although solutions often come in a flash, the ideas they resolve often germinate from partial statements whose history takes effort to trace. One thing we can say is that the GCI does not originate with Carl Gauss, nor should it be considered named for him. A Gaussian measure on (centered on the origin) is defined by having the probability density
where is a non-singular covariance matrix and just means the transpose of . Its projection onto any component is a usual one-variable normal distribution.
Suppose is a 90% confidence interval for and a 90% confidence interval for another variable . What is the probability that both variables fall into their intervals? If they are independent, then it is .
What if they are not independent? If they are positively correlated, then we may expect it to be higher. If they are inversely related, well…let’s also suppose the variables have mean and the intervals are symmetric around : , . Do we still get ? This—extended to any subset of the variables with any smattering of correlations and to other shapes besides the products of intervals—is the essence of the conjecture.
Charles Dunnett and Milton Sobel considered some special cases, such as when is an outer product for some vector , which makes it positive definite. Their 1955 paper is considered by some to be the source of GCI.
But it was Olive Dunn who first posed the above general terms in a series of papers that have had other enduring influence. The first paper in 1958 and the second in 1959 bore the like-as-lentils titles:
These seem to have generated confusion. The former is longer and frames the confidence-interval problem and is the only one to cite Dunnett-Sobel, but it does not mention a “conjecture.” The latter does discuss at the end exactly the conjecture of extending a case she had proved for to arbitrary , but relates a reader’s counterexample. Natalie Wolchover ascribed the 1959 paper in her article linked above, but Wikipedia and other sources reference the 1958 paper, while subsequent literature we’ve seen has instances of citing either—and never both.
Dunn became a fellow of the American Statistical Association, a fellow of the American Association for the Advancement of Science (AAAS), and a fellow of the American Public Health Association. In 1974, she was honored as the annual UCLA Woman of Science, awarded to “an outstanding woman who has made significant contributions in the field of science.” Her third paper in this series, also 1959, was titled “Confidence intervals for the means of dependent normally distributed variables.” Her fourth, in 1961, is known for the still-definitive form of the Bonferroni correction for joint variables. But in our episode of “CSI: GCI” it seems we must look later to find who framed the conjecture as we know it.
Not an ad. Amazon source. So is it an ad? |
Sobel came back to the scene as part of a 1972 six-author paper, “Inequalities on the probability Content of Convex Regions for Elliptically Contoured Distributions.” They considered integrals of the form
for general functions besides and for general positive definite . GCI in this case then has the form where is the identity matrix. They call elliptically contoured provided is finite. Writing about the history, they say (we have changed a few symbols and the citation style):
Inequalities for perhaps originate with special results of Dunnett and Sobel (1955) and of Dunn (1958), in which it is shown that for special forms of (with ) or for special values of .
They mention also an inequality by David Slepian and what they termed “the most general result for the normal distribution” by Zbynek Šidák, still with special conditions on . Their main result is “an extension of Šidák’s result to general elliptically contoured densities [plus] a stronger version dealing with a convex symmetric set.” This is where the relaxation from products of confidence intervals took hold. At last, after their main proof in section 2 and discussion in section 3, we find the magic word “conjecture”:
This suggests the conjecture: if is a random vector (with of dimension and of dimension ) having density and if and are convex symmetric sets, then
where
Clearly by iteration this implies the inequality with regard to . Here symmetric means just that belongs whenever belongs. Any symmetric convex set can be decomposed into strips of the form for fixed and , which their generality set them up to handle, and proving the inequality for strips suffices. This is considered the modern statement of GCI. The rest of their paper—over half of it—treats attempts to prove it and counterexamples to some further extensions.
Finally in 1977, Loren Pitt proved the case , referencing the 1972 paper and Šidák but not Dunnett-Sobel or Dunn. Wolchover interviewed Pitt for her article, and this extract is revealing:
Pitt had been trying since 1973, when he first heard about [it]. “Being an arrogant young mathematician … I was shocked that grown men who were putting themselves off as respectable math and science people didn’t know the answer to this,” he said. He locked himself in his motel room and was sure he would prove or disprove the conjecture before coming out. “Fifty years or so later I still didn’t know the answer,” he said.
So as for framing GCI, whodunit? Royen ascribes it to the 1972 paper which is probably what popularized it to Pitt, but Dunn’s orthogonal-intervals formulation spurred the intervening work, accommodates extensions noted as equivalent to GCI by Royen citing this 1998 paper, and still didn’t get solved until Royen. So we find these two sources equally “guilty.”
The 1972 form of GCI has a neatly compact statement and visualization:
For any symmetric convex sets in and any Gaussian measure on centered at the origin,
That is, imagine overlapping shapes symmetric about the origin in some Euclidean space. Throw darts that land with a Gaussian distribution around the origin. The claim is that the probability that the a dart will land on both shapes is at least the probability that it will land in one shape times the probability that it lands in the other shape.
UK Daily Mail source |
George Lowther, in his blog “Almost Sure,” has an interesting post about early attempts to solve GCI. He notes the following partial results from the above-mentioned 1998 paper:
The first statement proves GCI in a “shrunken” sense, while the second makes that seem tantamount to solving the whole thing. Lowther explained, however:
Unfortunately, the constant in the first statement is , which is strictly less than one, so the second statement cannot be applied. Furthermore, it does not appear that the proof can be improved to increase to one. Alternatively, we could try improving the second statement to only require the sets to be contained in the ball of radius for some but, again, it does not seem that the proof can be extended in this way.
Royen did not use this idea—indeed, Wolchover quotes Pitt as saying, “what Royen did was kind of diametrically opposed to what I had in mind.” Instead she explains how Royen used a kind of smoothing between the original matrix and (with off-diagonal entries zeroed out as above) as a quantity varies from to , taking derivatives with respect to . For this he had tools involving transforms and other tricks at hand:
“He had formulas that enabled him to pull off his magic,” Pitt said. “And I didn’t have the formulas.”
Royen’s short paper does need the background of these tricks to follow, and the fact that the same tricks enabled a further generalization of GCI makes it harder. The proof was made more self-contained in this 2015 paper by Rafał Latała and Dariusz Matlak (final version) and in a 2016 project by Tianyu Zhou and Shuyang Shen at the University of Toronto, both focusing just on GCI and cases closest to Dunn’s papers. Rather than go into proof details here, we’ll say more about the wider context.
Independent events are usually the best type of events to work with. Recall if and are independent events then,
Of course actually more is true: . But we focus on the inequality, since it can hold when and are not independent. In general without some assumption on the events and the above inequality is not true: Consider the event fair coin is heads and that it is tails. Then becomes .
Since independence is not always true for two events it is of great value to know when is still true. Even an approximation is of great value. Note, a simple case where it still is true is when , then the inequality is trivial, .
GCI reminds us of another inequality that intuitively cuts very fine and was difficult to prove: the FKG inequality. Ron Graham wrote a survey of FKG that begins with a discussion of Chebyshev’s sum inequality, named after the famous Pafnuty Chebyshev.
Chebychev’s sum inequality states that if
and
then
Wikipedia’s FKG article says how the relevance expands to other inequalities:
Informally, [FKG] says that in many random systems, increasing events are positively correlated, while an increasing and a decreasing event are negatively correlated.
An earlier version, for the special case of i.i.d. variables, … is due to Theodore Edward Harris (1960) … One generalization of the FKG inequality is the Holley inequality (1974) below, and an even further generalization is the Ahlswede-Daykin “four functions” theorem (1978). Furthermore, it has the same conclusion as the Griffiths inequalities, but the hypotheses are different.
We wonder whether the new results on GCI will spur an over-arching appreciation of all these inequalities involving correlated variables. We also wonder if in the complex case there is any connection between Royen’s smoothing technique and the process of purifying a mixed quantum state.
The amazing personal fact is that a retired mathematician solved the problem and did it with a relatively simple proof. What does this say about our core conjectures in theory? I am near retirement from Georgia Tech—does that mean I will solve some major open problem? Hmmmmmmm.
Also, which of you have had key insights come in the bathroom?
[nonsingular R–>positive definite R, other tweaks]
Boaz Barak and Michael Mitzenmacher are well known for many great results. They are currently working not on a theory paper, but on a joint “experiment” called Theory Fest.
Today Ken and I want to discuss their upcoming experiment and spur you to consider attending it.
There are many pros and some cons in attending the new Theory Fest this June 19-23. One pro is where it is being held—Montreal—and another is the great collection of papers that will appear at the STOC 2017 part of the Fest. But the main ‘pro’ is that Boaz and Mike plan on doing some special events to make the Fest more than just a usual conference on theory.
The main ‘con’ is that you need to register soon here, so do not forget to do that.
We humbly offer some suggestions to spice up the week:
A Bug-a-thon: Many conferences have hack-a-thons these days. A theory version could be a P=NP debugging contest. Prior to the Fest anyone claiming to have solved P vs NP must submit a paper along with a $100 fee– -Canadian. At the Fest teams of “debuggers” would get the papers and have a fixed time—say three hours—to find a bug in as many papers as they can. The team that debugs the most claims wins the entrance fees.
Note that submissions can be “stealth”—you know your paper is wrong, but the bugs are very hard to find.
Present a Paper: People submit a deck for a ten minute talk. Then randomly each is assigned a deck and they must give a talk based only on the deck. There will be an audience vote and the best presenter will win a trophy.
Note there are two theory issues. The random assignment must be random but fixed-point free—-no one can get their own deck. Also since going last seems to give an unfair advantage, we suggest that each person gets the deck only ten minutes before their talk. Thus all presenters would have the same time to prepare for their talk.
Silent Auction For Co-authorship: We will set up a series of tables. On each table is a one page abstract of a paper. You get to bid as in a standard silent auction. The winner at each table becomes a co-author and pays their bid to STOC. The money could go to a student travel fund.
The A vs B Debate: Theory is divide into A and B at least in many conferences. We will put together a blue ribbon panel and have them discuss: Is A more important than B? We will ask that the panel be as snippy as possible—a great evening idea while all drink some free beer.
Betting: We will have a variety of topics from P=NP to quantum computation where various bets can be made.
Cantal Complexity: The Fest will mark the 40th anniversary of Donald Knuth’s famous paper, “The Complexity of Songs.” Evening sessions at a pub will provide unprecedented opportunity for applied research in this core area. Ken’s research, which he began with Dexter Kozen and others at the ICALP 1982 musicfest, eventually led to this.
Lemmas For Sale: In an Ebay-like manner a lemma can be sold. We all have small insights that we will never publish, but they might be useful for others.
Zoo Excursion: This is not to the Montreal zoo—which is rather far—but to the Complexity Zoo which is housed elsewhere in Canada. Participants will take a virtual tour of all 535 classes. The prize for “collapsing” any two of them will be an instant STOC 2017 publication. In case of collapsing more than two, or actually finding a new separation of any pair of them, see under “Bug-a-thon” above.
Write It Up: This is a service-oriented activity. Many results never have been written up formally and submitted to journals. Often the reason is that the author(s) are busy with new research. This would be a list of such papers and an attempt to get students or others to write up the paper. This has actually happen many times already in an informal manner. So organizing it might be fun. We could use money to get people to sign up—or give a free registration to next years conference— for example.
GLL plans on gavel-to-gavel coverage of the Fest: we hope to have helpers that will allow us to make at least one post per day about the Fest. Anyone interested in being a helper should contact us here.
This will be especially appreciated because Ken will be traveling to a different conference in a voivodeship that abuts an oblast and two voblasts.
]]>
It takes a …
Sir Tim Berners-Lee is the latest winner of the ACM Turing Award. He was cited for “inventing the World Wide Web (WWW), the first web browser, and the fundamental protocols and algorithms allowing the web to scale.”
Today we congratulate Sir Tim on his award and review the work by which the Web flew out and floated wide.
Ken is the lead writer on this, and I (Dick) am just making a few small additions and changes: the phrase “flew out and floated wide” is due to Ken. He was until a while ago trapped in the real world where physical travel is still required. More exactly he was trapped at JFK airport in New York, which many consider not the best airport to be stuck at. The WWW may be wonderful for work at a distance, but we sometimes have to get from here to there: in Ken’s case it’s from Buffalo USA to Madrid Spain for meetings on chess cheating.
While he was stuck at JFK he had the pleasure of using the free airport WiFi. Free is generally good but it yields messages on his browser window that say, “Waiting for response from…” Ken adds:
OK, that window has my Yahoo! fantasy baseball team—I’ll use Verizon 4G access on my cellphone if it doesn’t load. Happily I can write these words offline until Dick and I can resume collaborating on this post when I’m airborne.
Let’s wish Ken good travels out of JFK, and get back to Ken’s thoughts on this Turing Award.
Some have already noted that others besides Berners-Lee were integral to the early days of the Web: his partners at CERN including Robert Cailliau and also Marc Andreessen who wrote the Mosaic browser with Eric Bina and founded Netscape with Jim Clark. Usually we lean toward the “It Takes a Village” view of multiple credits. Last year we addressed whether Ralph Merkle should have been included in the Turing Award with Whitfield Diffie and Martin Hellman. There we signaled our feeling by including Merkle in the first line and photo.
Here, however, we note first that Berners-Lee not only conceived a flexible architecture for the Web, he originated a trifecta: the HTTP protocol, the HTML language, and the first browser design. The protocol included the specification for URLs—Uniform Resource Locators. Of course he had partners on these designs and tools, including counterparts involved in negotiating their adoption, but that brings us to our second point.
This is that Berners-Lee projected his will that the Web be open and free. Its core layers should be free of patent and copyright attachments. Service and access should in first principle be equal everywhere. He convinced many others to share and implement these aspects of will.
Often an idea is invented multiple times. Often ideas remain just ideas because the technology is not ready to implement them. It’s curious to note that Berners-Lee, Steve Jobs, and Bill Gates share the same birth year: 1955. One wonders did this help make things happen? Was there some confluence that made the ideas for the WWW all come together?
Ken notes that “As We May Think” is a 1945 essay by Vannevar Bush. Many of the ideas expressed by Bush are basic to the current WWW. Of course it was written ten years before Berners-Lee was even born, so it is not surprising that Bush did not invent the web.
Quoting Wikipedia:
“As We May Think” predicted (to some extent) many kinds of technology invented after its publication, including hypertext, personal computers, the Internet, the World Wide Web, speech recognition, and online encyclopedias such as Wikipedia: “Wholly new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.” Bush envisioned the ability to retrieve several articles or pictures on one screen, with the possibility of writing comments that could be stored and recalled together. He believed people would create links between related articles, thus mapping the thought process and path of each user and saving it for others to experience. Wikipedia is one example of how this vision has been realized, allowing users to link words to other related topics, while browser user history maps the trails of the various possible paths of interaction.
We applaud ACM for selecting this year’s winner—Sir Tim Berners-Lee. There are perhaps too many great researchers around for all to get the recognition they deserve. Oh well. In any event Dick and I thank Berners-Lee and everyone who maked the WWW possible. Even as I (Ken) wait at JFK I can still help get this writing done, and can interact with Dick. Thanks to all who continue to make this work so well.
]]>
Could we go the way of telegraph operators?
Pixabay source |
Lofa Polir has sent us some new information that will have widespread ramifications for math and theory and science in general.
Today Ken and I wish to comment on this information.
Polir is sure that this information is correct. If he is correct the consequences for all will be immense.
His information is based on recent work of Sebastian Thrun—one of the world’s experts in machine learning. This week’s New Yorker has a featured article in which Thrun’s work on replacing doctors who diagnose skin diseases is presented. The article describes him thus:
Thrun, who grew up in Germany, is lean, with a shaved head and an air of comic exuberance; he looks like some fantastical fusion of Michel Foucault and Mr. Bean. Formerly a professor at Stanford, where he directed the Artificial Intelligence Lab, Thrun had gone off to start Google X, directing work on self-learning robots and driverless cars.
Thrun’s work is really interesting, and he has stated that medical schools should stop teaching doctors to read X-rays and other images, since robotic systems will soon be better at this. His system for skin images already beats expert doctors at detecting abnormal growths.
But this project along with his others is a smokescreen for his most important project, claims Polir. Thrun has put together a double-secret project that has been running for over five years. The project’s goal is: the automation of math and other sciences. Thrun predicts—well, let’s take a look at what he is doing first.
Thrun’s project is to use machine learning methods to build a system that can outperform us in doing science of all kinds. It requires huge amounts of data and he has access to that via the web. The strategy is an exact parallel of how Google DeepMind’s AlphaGo program. Quoting our friends on Wikipedia regarding the latter:
The system’s neural networks were initially bootstrapped from human gameplay expertise. AlphaGo was initially trained to mimic human play by attempting to match the moves of expert players from recorded historical games, using a database of around 30 million moves. Once it had reached a certain degree of proficiency, it was trained further by being set to play large numbers of games against other instances of itself, using reinforcement learning to improve its play.
In place of reading and digesting master games of Go, Thrun’s system reads and digests scientific papers. The ability to have his algorithm “read” all papers in science is the secret:
Thrun points out that mathematicians in their lifetime may read and understand thousands of papers, but his system is capable of understanding millions of papers.
This ability is one of the reasons his algorithm will outperform us. Another is it can use immense computational power 24/7. It never needs to sleep or rest. Polir claims that Google has made an entire secret data center of over a billion CPU cores available to this project. In a closed agreement with the University of Wisconsin, the center is housed in the new IceCube neutrino observatory. Polir justifies revealing this on grounds it should be obvious—would they really hew out a cubic kilometer of ice in Antarctica just to observe neutrinos and ignore the cooling cost benefits of placing a huge processing center in the cavity?
Old-time theorem provers used lots of axiom and proof rules. This kind of approach can only go yea-far. Homotopy type theory, which tries out a more topological approach, provided part of the inspiration to Thrun that he could find a better way. Another part was Roger Penrose’s argument that humans are less blocked by Kurt Gödel’s Incompleteness Theorems than logic-based systems are. So Thrun was spurred to start by making his machine learn from humans, much like AlphaGo.
In the New Yorker article—with extra information gleaned by Polir—Thrun describes the situation this way:
“Imagine an old-fashioned program to identify a dog,” he said. “A software engineer would write a thousand if-then-else statements: if it has ears, and a snout, and has hair, and is not a rat . . . and so forth, ad infinitum. But that’s not how a child learns to identify a dog, of course.” Logic-based proof systems work the same way, but that’s not really how we go about identifying a proof. Who checks modus ponens on every line? “The machine-learning algorithm, like the child, pulls information from a training set that has been classified. Here’s a dog, and here’s not a dog. It then extracts features from one set versus another.” Or like a grad student it learns: here’s a proof, and here’s not a proof. And, by testing itself against hundreds and thousands of theorems and proofs, it begins to create its own way to recognize a proof—again, the way a grad student does. It just knows how to do it.
Polir confirmed that Thrun’s machine first runs the papers through the kind of “Lint”‘-like module we posted about. This is not only a data-cleaning step but primes the reinforcement learning module on the mathematical and scientific content.
Then comes a Monte Carlo phase in which the system randomly generates alternative proofs of lemmas in the papers and scores the proofs for economy and clarity. This completes the automated paper-rewriting level of their service, which is under negotiations with Springer-Verlag and Elsevier and other academic publishers for deals that may assure steady funding of the larger project. Finally, the results of these runs are input into the deep-learning stack, which infers the kinds of moves that are most likely to lead to correct proofs and profitable discoveries.
One of the predictions that Thrun makes is that like with doctors we may need to start thinking about training students to get PhDs in math soon. He goes on to raise the idea that the machine will make such basic discoveries that it will win Nobel Prizes in the future.
The results of Thrun’s project are so far secret, and it is likely that he will deny that it is happening right now. But Polir found out one example of what has been accomplished already.
Particle physics of the Standard Model uses quite a few elementary particles. See this for a discussion.
These 31 elementary particles are the most fundamental constituents of the universe. They are not, as far as we know, made up of other particles. The proton, for example, is not an elementary particle, because it is made up of three quarks, whereas the electron is an elementary particle, because it seems to have no internal structure.
Although the Standard Model has worked impeccably in practice, it has higher complexity than physicists have expected from a bedrock theory of nature. The complexity comes from the large number of particles and the large number of constants that the model cannot predict.
A cluster of Thrun’s dedicated machines has already found a new model that reduces the number of elementary particles from 31 to 7. The code name for the cluster and its model, in homage to AlphaGo, is AlphaO. The AlphaO model is claimed to still make all the same predictions as the standard one, but the reduction in undetermined constants could be immensely important.
Is Polir fooling? He may be and not be at the same time. If you had told us a year-plus ago that AlphaGo would wipe out the world’s best Go players 60-0 in online semi-rapid games, we would have cried fool. The AlphaGo project is an example of a machine coming from nowhere to become the best in a game that people thought was beyond the ability of machines. Could it be soon the same with AlphaO? We will see.
]]>
Science meets bias and diversity
Deborah Belle is a psychology professor at Boston University (BU) who is interested in gender differences in social behavior. She has reported a shocking result about bias.
Today I thought I would discuss the issue of gender bias and also the related issue of the advantages of diversity.
Lately at Tech we have had a long email discussion on implicit bias and how we might do a better job of avoiding it in the future. My usual inclination is to think about such issues and see if there is some science behind our assumptions. One colleague stated:
The importance of diversity is beyond reasonable doubt, isn’t it?
I agree. But I am always looking for “proof.”
Do not get me wrong. I have always been for diversity. I helped hire the first female assistant professor to engineering at Princeton decades ago. And I have always felt that it is important to have more diversity in all aspects of computer science. But is there some science behind this belief? Or is it just axiomatic—something that we believe and needs no argument—that it is “beyond reasonable doubt?”
This is how I found Deborah Belle, while looking on the web for “proof.” I will just quote the BU Today article on her work:
Here’s an old riddle. If you haven’t heard it, give yourself time to answer before reading past this paragraph: a father and son are in a horrible car crash that kills the dad. The son is rushed to the hospital; just as he’s about to go under the knife, the surgeon says, “I can’t operate—that boy is my son!” Explain …
If you guessed that the surgeon is the boy’s gay, second father, you get a point for enlightenment… But did you also guess the surgeon could be the boy’s mother? If not, you’re part of a surprising majority.
In research conducted by Mikaela Wapman […] and Deborah Belle […], even young people and self-described feminists tended to overlook the possibility that the surgeon in the riddle was a she. The researchers ran the riddle by two groups: 197 BU psychology students and 103 children, ages 7 to 17, from Brookline summer camps.
In both groups, only a small minority of subjects—15 percent of the children and 14 percent of the BU students—came up with the mom’s-the-surgeon answer. Curiously, life experiences that might [prompt] the ‘mom’ answer “had no association with how one performed on the riddle,” Wapman says. For example, the BU student cohort, where women outnumbered men two-to-one, typically had mothers who were employed or were doctors—“and yet they had so much difficulty with this riddle,” says Belle. Self-described feminists did better, she says, but even so, 78 percent did not say the surgeon was the mother.
This shocked me. I knew this riddle forever it seems. But was surprised to see that the riddle is still an issue. Ken recalls from his time in England in the 1980s that surgeons were elevated from being addressed as “Doctor X” to the title “Mister X.” No mention of any “Miss/Mrs/Ms” possibility then, but this is now. I think this demonstrates in a pretty stark manner how important it is to be aware of implicit bias. My word, things are worse than I ever thought.
I looked some more and discovered that there was, I believe, bias in even studies of bias. This may be even more shocking: top researchers into the importance of diversity have made implicit bias errors of their own. At least that is how I view their research.
Again I will quote an article, this time from Stanford:
In 2006 Margaret Neale of Stanford University, Gregory Northcraft of the University of Illinois at Urbana-Champaign and I set out to examine the impact of racial diversity on small decision-making groups in an experiment where sharing information was a requirement for success. Our subjects were undergraduate students taking business courses at the University of Illinois. We put together three-person groups—some consisting of all white members, others with two whites and one nonwhite member—and had them perform a murder mystery exercise. We made sure that all group members shared a common set of information, but we also gave each member important clues that only he or she knew. To find out who committed the murder, the group members would have to share all the information they collectively possessed during discussion. The groups with racial diversity significantly outperformed the groups with no racial diversity. Being with similar others leads us to think we all hold the same information and share the same perspective. This perspective, which stopped the all-white groups from effectively processing the information, is what hinders creativity and innovation.
Nice study. But why only choose to study all-white groups and groups of two whites and one black? What about the other two possibilities: all black and two blacks and one white? Did this not even occur to the researchers? I could imagine that all-black do the best, or that two black and one white do the worst. Who knows. The sin here seems to be not even considering all the four combinations.
Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai have a recent paper in NIPS with the wonderful title, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.”
Again we will simply quote the paper:
The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors, which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between the words receptionist and female, while maintaining desired associations such as between the words queen and female. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.
Here is one of their examples. Suppose we want to fill X in the analogy, “he is to doctor as she is to X.” A typical embedding prior to their algorithm may return X = nurse. Their hard-debiasing algorithm finds X = physician. Yet it recognizes cases where gender distinctions should be preserved, e.g., given “she is to ovarian cancer as he is to Y,” it fills in Y = prostate cancer. Their results show that their hard-debiasing algorithm performs significantly better than a “soft-debiasing” approach and performs as well or nearly as well on benchmarks apart from gender bias.
Overall, however, many have noted that machine learning algorithms are inhaling the bias that exists in lexical sources they data-mine. ProPublica has a whole series on this, including the article, “Breaking the Black Box: How Machines Learn to be Racist.” And sexist, we can add. The examples are not just linguistic—they include real policy decisions and actions that are biased.
Ken wonders whether aiming for parity in language will ever be effective in offsetting bias. Putting more weight in the center doesn’t achieve balance when all the other weight is on one side.
The e-mail thread among my colleagues centered on the recent magazine cover story in The Atlantic, “Why is Silicon Valley so Awful to Women?” The story includes this anecdote:
When [Tracy] Chou discovered a significant flaw in [her] company’s code and pointed it out, her engineering team dismissed her concerns, saying that they had been using the code for long enough that any problems would have been discovered. Chou persisted, saying she could demonstrate the conditions under which the bug was triggered. Finally, a male co-worker saw that she was right and raised the alarm, whereupon people in the office began to listen. Chou told her team that she knew how to fix the flaw; skeptical, they told her to have two other engineers review the changes and sign off on them, an unusual precaution.
One of my colleagues went on to ascribe the ‘horribleness’ of many computer systems in everyday use to the “brusque masculinism” of their creation. This leads me to wonder: can we find the “proof” I want by making a study of the possibility that “men are buggier”—or more solidly put, that gender diversity improves software development?
Recall Ken wrote a post on themes connected to his department’s Distinguished Speaker series for attracting women into computing. The series includes our own Ellen Zegura on April 22. The post includes Margaret Hamilton and her work for NASA’s Apollo missions, including the iconic photo of the stack of her code being taller than she. Arguments over the extent of Hamilton’s role can perhaps be resolved from sources listed here and here, but there is primary confirmation of her strong hand in code that had to be bug-free before deployment.
We recently posted our amazement of large-scale consequences of bugs in code at underclass college level, such as overflowing a buffer. Perhaps one can do a study of gender and project bugs from college or business applications where large data could be made available. The closest large-scale study we’ve found analyzed acceptance rates of coding suggestions (“pull requests”) from over 1.4 million users of GitHub (news summary) but this is not the same thing as analyzing design thoroughness and bug rates. Nor is anything like this getting at the benefits of having men and women teamed together on projects, or at least in a mutual consulting capacity.
It is easy to find sources a year ago hailing that study in terms like “Women are Better Coders Than Men…” Ordinarily that kind of “hype” repulses Ken and me, but Ken says maybe this lever has a rock to stand on. What if we ‘think different’ and embrace gender bias by positing that women approach software in significantly different ways—?—where having such differences is demonstrably helpful.
What would constitute “proof” that gender diversity is concretely helpful?
]]>
Littlewood’s Law and Big Data
“Leprechaun-proofing” data source |
Neil L. is a leprechaun. He has visited Dick on St. Patrick’s Day or the evening before many times. Up until this night I had never seen him.
Today, Neil’s message is more important than ever.
With over a foot of snow in Buffalo this week and the wind still howling, I was not expecting anything green. Long after Debbie had gone to bed, I was enmeshed in the “big–data blues” that have haunted me since summer and before. I was so fixated it took me more than a few seconds to realize that wisps of green smoke floating between me and the computer screen were something I should investigate.
There on our kitchen-study divider sat Neil. He looked like the pictures Dick had posted of him, but frazzled. He cleaned his pipe into a big Notre Dame coffee mug I got as a gift. I’d had it out since Princeton went up against Notre Dame in “March Madness”—my Tigers missed a chance for a big upset in the closing five seconds. As if reading my mind, he remarked how the tournament always produces upsets in the first round:
“If there be no unusual results, ‘twould be most unusual.”
The Neil whom Dick described would have said this with wry mirth, but he sounded weary as if he had a dozen mouths to feed. I fired up the kettle and brought out the matching mug to offer tea or coffee, but he pointed to his hip flask and said “it’s better against the cold.”
That prompted me to ask, “Why didn’t you visit Dick? He and Kathryn have been enjoying sun at this week’s Bellairs workshop on Barbados.” I had been there two years ago when Neil had taken great pains to track Dick down. Neil puffed and replied, “Same reason I didn’t try finding you there back then—too far afield for a big family man.” The word “family” struck me as our dog Zoey, who had stayed sleeping in her computer-side bed at my feet, woke up to give Neil a barkless greeting. Of course, even leprechauns have relations…
Nodding to pictures of our children on the wall, I asked Neil how many he had. He took a long puff and replied:
“Several thousand. It’s too hard to keep count nowadays.”
Now Zoey barked, and this covered my gasp. Knowing that Neil was several centuries old, I did some mental arithmetic, but concluded he would still need a sizable harem. Reading my mind again, Neil cut in:
“Not as ye mortals do. What d’ye think we’re made of?”
I reached out to touch him, but Neil leaned away and vanished. A moment later he popped back and folded his arms, waiting for me to reply. I realized, ah, he is made of spirit matter. What can that be? Only one thing in this world it could be: information.
“Tá tú ceart” he whistled. “Right. And some o’ yer sages wit ye mortals have some o’ the same stuff. Max Tegmark, for one, wrote:
“… consciousness is the way information feels when being processed.”
And Roger Penrose has just founded a new institute on similar premises—up front he says chess holds a key to human consciousness so you of all people should know whereof I speak.”
Indeed, I had to nod. He continued, “And information has been growing faster than Moore’s Law. Hard to keep up…” The last words came with a puff of manly pride.
“Information is leprechauns??,” I blurted out. The propagation of “fake news” and outright falsehoods in recent months has been hard enough to take, but this boiled me over. I wanted to challenge Neil—and I recalled the protocol followed by William Hamilton’s wife: glove and shamrock at his feet. Well, I don’t wear gloves even in zero-degree weather, and good luck my finding a shamrock under two feet of snow. So I asked in a level voice, “can you give me some examples?”
Neil puffed and replied, “Not that information be us, but it bears us. And more and more ye can get to know us by reading your information carefully. But alas, more and more ye are confusing us with aliens.”
“Aliens?” This was all too much, and the dog wanted out. But Neil was happy to flit alongside me as I opened the door to the yard for her. He explained in simple tones:
“Ye have been reading the sky for many decades listening for alien intelligence. Up to last year ye had maybe one possible instance in 1977—apart from Nikola Tesla, who knew us well. But now reports are coming fast and furious. Not just fleeting sequences but recurrent ‘fast radio bursts’ observed in papers and discussed even this week by scientists from Harvard. Why so many now?”
I was quick to answer: “Because we are reading so much more data now.” Neil clapped his hands—I expected something to materialize by magic but he was just affirming my reply. I hedged, “But surely we understand the natural variation?” Neil retorted:
“Such understanding didn’t prevent over 500 physics papers being written on a months-long fluctuation at the Large Hadron Collider before it dissolved last summer.”
Indeed, the so-called diphoton anomaly had seemed on its way to confirmation because two separate experiments at the LHC were seeing it. An earlier LHC anomaly about so-called “penguin decays” has persisted since 2013 with seemingly no conclusion.
As I let the dog back in and toweled snow off her, I reflected: what was wrong with those 500 physics papers? A particle beyond the Standard Model would be the pot of gold at the end of a rainbow not only for many researchers but human knowledge on the whole. Then I remembered whom I was speaking with. Once free of the towel, Zoey scooted away, and I regrouped. I turned to Neil and said, “There is huge work on anomaly detection and data cleansing to identify and remove spurious data. Surely we are scaling that up as needed…”
Neil took a long drag on his pipe and arched up:
“I be not talking o’ bad data points but whole data sets, me lad.”
I sank into an armchair and an electrical voltage drop dimmed the lights as Neil took over, perched again on the divider. “Ye know John Littlewood’s law of a miracle per month, indeed you wrote a post on it. If ye do a million things or observe a million things, one o’ them is bound to be a million-to-one shot.”
I nodded, already aware of his point.
“No different ’tis with data sets. One in a million be one-in-a-million bad. A thousand in a million be—begorra—one-in-a-thousand bad. Or too good. If ye ha’e 50,000 companies and agencies and research groups doing upwards of 20 data sets each, that’s wha’ ye have. Moreover—”
Neil leaned forward enough to fall off the counter but of course he didn’t fall.
“All the cleansing, all the cross-validation ye do, all the confirmation ye believe, is merely brought inside this reckoning. All that also changes the community standards, and by those standards ye’re still one-in-a-million, one-in-a-thousand off. Now ye may say, 999-in-a-thousand are good, a fair run o’ the mill. But think of the impacts. Runs o’ the mill have run o’ the mill effects, but the stark ones, hoo–ee.”
He whistled. “The impacts of the ones we choose to reside in scale a thousand-to-one stronger, a million-to-one… An’ that is how we keep up a constant level of influence in affairs o’ the world. All o’ the world—yer hard science as well as social data.”
I thought of something important: “If you lot choose to commandeer one data set, does that give you free rein to infect another of the same kind?”
“Nae—ye know from Dick’s accounts, we must do our work within the bounds of probability. So if ye get a whiff of us or even espy us, ye can take double the data without fear of us. But—then ye be subject to the most subtle kind of sampling bias, which is the bias of deciding when to stop sampling.”
After the terrible anomaly I showed in December from four data points of chess players rated near 2200, 2250, 2300, and 2350 on the Elo scale, I had spent much of January filling in 2225, 2275, 2325, and 2375. Which improved the picture quite a lot. Of course I ran all the quarter-century marks from Elo 1025 to Elo 2775, over three million more moves in all. But instead of feeling pride, after Neil’s last point I looked down at the floor.
His final words were gentle:
“Cheer up lad, it not only could be worse, it would be worse. Another o’ your sages, Nassim Taleb, has pointed out what he calls the ‘tragedy of big data’: spurious correlations and falsity grow faster than information. See that article’s graphic, which looks quadratic or at any rate convex. Then be ye thankful, for we Leprechauns are hard at work keeping the troubles down to linear. But this needs many more of us, lad, so I must be parting anon.”
And with a pop he was gone.
Is Neil right? What examples might you know of big data sets suspected of being anomalous not for any known systematic reason but just the “luck of the draw”?
Happy St. Patrick’s Day anyway.
[some word changes]
Holly Dragoo, Yacin Nadji, Joel Odom, Chris Roberts, and Stone Tillotson are experts in computer security. They recently were featured in the GIT newsletter Cybersecurity Commentary.
Today, Ken and I consider how their comments raise a basic issue about cybersecurity. Simply put:
Is it possible?
In the column, they discuss various security breaks that recently happened to real systems. Here are some abstracts from their reports:
The last is an attempt to make attacks harder by using randomization to move around key pieces of systems data. It seems like a good idea, but Dan Boneh and his co-authors have shown that it can be broken. The group is Hovav Shacham, Eu-Jin Goh, Matthew Page, Nagendra Modadugu, and Ben Pfaff.
Here we talk about the first item at length, plus another item by Odom on the breaking of a famous hash function.
With all due respect to a famous song by Sonny Bono and Cherilyn Sarkisian, “The Beat Goes On“: I have changed it some, but I think it captures the current situation in cybersecurity.
The breaks go on, the breaks go on
Drums keep pounding
A rhythm to the brain
La de da de de, la de da de daLaptops was once the rage, uh huh
History has turned the page, uh huh
The iPhone’s the current thing, uh huh
Android is our newborn king, uh huh[Chorus]
A definition of insanity ascribed to Albert Einstein goes:
Insanity is doing the same thing over and over again and expecting different results.
I wonder lately whether we are all insane when it comes to security. Break-ins to systems continue; if anything they are increasing in frequency. Some of the attacks are simply so basic that it is incredible. One example is an attack on a company that is in the business of supplying security to their customers. Some of the attacks use methods that have been known for decades.
Ken especially joined me in being shocked about one low-level detail in the recent “Cloudbleed” bug. The company affected, Cloudflare, posted an article tracing the breach ultimately to these two lines of code that were auto-generated using a well-known parser-generator called Ragel:
if ( ++p == pe ) goto _test_eof;
The pointer p is in client hands, while pe is a system pointer marking the end of a buffer. It looks like p can only be incremented one memory unit at a time, so that it will eventually compare-equal to pe and cause control to jump out of the region where the client can govern HTML being processed. Wrong. Other parts of the code make it possible to enter this test with p > pe which allows undetected access to unprotected blocks of memory. Not only was it a memory leak but private information could be exposed.
The bug was avoidable by rewriting the code-generator so that it would give:
if ( ++p >= pe ) goto _test_eof;
But we have a more basic question:
Why are such low-level bits of 1960s-vintage code carrying such high-level responsibility for security?
There are oodles of such lines in deployed applications. They are not even up to the level of the standard C++ library which gives only == and != tests for basic iterators but at least enforces that the iterator must either be within the bounds of the data structure or must be on the end. Sophisticated analyzers help to find many bugs, but can they keep pace with the sheer volume of code?
Note: this code was auto-generated, so we not only have to debug actual code but potential code as well. The Cloudflare article makes clear that the bug turned from latent to actual only after a combination of other changes in code system patterns. It concludes with “Some Lessons”:
The engineers working on the new HTML parser had been so worried about bugs affecting our service that they had spent hours verifying that it did not contain security problems.
Unfortunately, it was the ancient piece of software that contained a latent security problem and that problem only showed up as we were in the process of migrating away from it. Our internal infosec team is now undertaking a project to fuzz older software looking for potential other security problems.
While admitting our lack of expertise in this area, we feel bound to query:
How do we know that today’s software won’t be tomorrow’s “older software” that will need to be “fuzzed” to look for potential security problems?
We are still writing in low-level code. That’s the “insanity” part.
My GIT colleagues also comment on Google’s recent announcement two weeks ago of feasible production of collisions in the SHA-1 hash function. Google fashioned two PDF files with identical hashes, meaning that once a system has accepted one the other can be maliciously substituted. They say:
It is now practically possible to craft two colliding PDF files and obtain a SHA-1 digital signature on the first PDF file which can also be abused as a valid signature on the second PDF file… [so that e.g.] it is possible to trick someone to create a valid signature for a high-rent contract by having him or her sign a low-rent contract.
Now SHA-1 had been under clouds for a dozen years already, since the first demonstration that collisions can found with expectation faster than brute force. It is, however, still being used. For instance, Microsoft’s sunset plan called for its phase 2-of-3 to be enacted in mid-2017. Google, Mozilla, and Apple have been doing similarly with their browser certificates. Perhaps the new exploit will force the sunsets into a total eclipse.
Besides SHA-2 there is SHA-3 which is the current gold standard. As with SHA-2 it comes in different block sizes: 224, 256, 384, or 512 bits, whereas SHA-1 gives only 160 bits. Doubling the block size does ramp up the time for attacks that have been conceived exponentially. Still, the exploit shows what theoretical advances plus unprecedented power of computation can do. Odom shows the big picture in a foreboding chart.
Is security really possible? Or are we all insane?
Ken thinks there are two classes of parallel universes. In one class, the sentient beings originally developed programming languages in which variables were mutable by default and one needed an extra fussy and forgettable keyword like const to make them constant. In the other class, they first thought of languages in which identifiers denoted ideal Platonic objects and the keyword mutable had to be added to make them changeable.
The latter enjoyed the advantage that safer and sleeker code became the lazy coder’s default. The mutable strain was treated as a logical subclass in accord with the substitution principle. Logical relations like Square “Is-A” Rectangle held without entailing that Square.Mutable be a subclass of Rectangle.Mutable, and this further permitted retroactive abstraction via “superclassing.” They developed safe structures for security and dominated their light cones. The first class was doomed.
[word changes in paragraph after pointer code: “end-user” –> “client”, HTML being “coded” –> “processed”.]
YouTube source |
Maurice Ashley is an American chess grandmaster. He played for the US Championship in 2003. He coached two youth teams from Harlem to national championships and played himself in one scene of the movie Brooklyn Castle. He created a TEDYouth video titled, “Working Backward to Solve Problems.”
Today we discuss retrograde analysis in chess and other problems, including one of my own.
Raymond Smullyan popularized retrograde chess puzzles in his 1979 book The Chess Mysteries of Sherlock Holmes and its 1981 sequel, The Chess Mysteries of the Arabian Knights. Here is the second example in the first book—except that I’ve added a white pawn on b4. What were the last three moves—two by White and one by Black—and can we tell the two moves before that?
Not only is Black checkmated, Black is outgunned on material. The puzzles do not try to be fair and many are “unnatural” as game positions. The point is that the positions are legal. There can occur in games from the starting position of chess—but only in certain ways that the solver must deduce.
White’s last move must have uncovered check from the white bishop on h1 because the bishop cannot have moved there. The uncovering could have come from White’s pawn on g2 capturing a black piece on h3 except that there is no way a bishop can get to h1 locked behind a white pawn on g2. So it must have been from the pawn on d6. If the last move were d5-d6 discovered check, then what was Black’s previous move? Black’s king cannot come from being adjacent to White’s king on b8 or b7, so it came from a7—but there it would have been in an impossible double check. The whole setup looks impossible, until we realize that White could have captured a black pawn en-passant after it moved from d7 to d5 to block the check. Play could have unfolded from this position:
The game could have gone 1. Bd6-c5+ Ka7-a8 2. e4-e5+ d7-d5 3. exd6 en-passant and checkmate. So the last three moves must have been 2. e4-e5+ d7-d5 3. exd6. But what about the first two? Suppose we were told that the checkmating move is the first time a white piece ever occupied the square d6. So the game didn’t go this way. Could it have gone another way that obeys this extra condition? The answer is at the end.
Ashley’s video raises the idea of retrograde analysis for planning and problem solving in life. If your goal is , then working backward from can tell you the subgoals needed to achieve it. The business executive Justin Bariso expanded on Ashley’s video even further in a neat post on his own blog. Bariso recommends to “plan your project backward,” opining that with the way things are usually done,
More time and money are scheduled for initial steps than are really needed.
Here is an example from Smullyan’s book which was also featured in a 2011 video by former world women’s champion Alexandra Kosteniuk:
How can this position come about—in particular, where was White’s queen captured? Focus on the main events—what was captured on b3, e6, h6, and in what order?—helps to plan the play. Incidentally, Kosteniuk was a hard-luck loser this past Friday in the semifinals of this year’s women’s world championship in Tehran.
What matters often in research, though, is finding the most propitious initial steps—and the time budget is open-ended. Often we build a new tool and set out to prove theorems whose statements we might not know in advance. Yet for statements like “Is ?” the question, “Is a theorem?” is a classic retrograde problem: Proof steps are the moves. A legal move either instantiates an axiom or follows by a rule of deduction from one or more earlier steps. Undecidability for the underlying formal system means there is no procedure to tell whether any given position is legal.
How might retrograde analysis help in complexity theory? Dick and I once ventured a notion of “progressive” algorithms. Maybe it can be supplemented by analyzing some necessary “regressive” behavior of algorithms. Or more simply put, using retrograde analysis to show that a statement is impossible may be what’s needed to prove .
I taught a hybrid algorithms-and-complexity short-course at the University of Calcutta last August. One of my points was to present breadth-first search (BFS) and depth-first search (DFS) as paradigms that correspond to the complexity classes and in particular. I illustrated more complicated algorithms that run in deterministic or nondeterministic logspace and explained how in principle they could be reduced to one call to BFS.
I presented pebbling in the guise of Monty Python’s version of King Arthur and his fellow “riders.” They wish to conquer a network of towns connected by one-way roads. Each town has a defensive strength . To conquer , the riders need to occupy at least other towns with incoming roads, then any rider may occupy . A rider can leave a town and go “in-country” at any time, but to re-enter a previously conquered town, that town must be re-conquered. Source node(s) can be freely (re-)occupied, and multiple riders can occupy the same node. Given a graph with source(s) s and goal node f, and an integer , the question is:
Can f be conquered by k riders starting on s, and if not, what is the minimum k?
The case of pebbling, strictly speaking, is when always equals the node’s in-degree, while is basically BFS. Various forms of pebbling have been studied and all were instrumental to various complexity results. Here is a simple example:
The answer is : Three riders can conquer in the order , then moves to and enters This gives the strength needed for to conquer , but needs to be re-conquered. This is done by starting again from to , and finally falls.
Now picture the following graph in which every OR gate has hit strength and every AND gate has hit strength . The gate labeled b is undetermined. If it is OR, then gate f can be conquered by riders as follows: Two riders staring at and first conquer gate a, and then using the free entry into gate b, have a conquer d. Then the rider on b rides back to and helps the third rider go from to e in like manner. Finally the riders on d and e conquer f. That was easy—but what if b is an AND gate? Can riders still do it? You may wish to ponder before looking below.
I gave this as part of a take-home final to over 30 students in the course, saying to argue that when b is an AND gate, riders are not enough. Almost all tried various forms of forward reasoning in their proofs. Many such proofs were incomplete, for instance not considering that riders could rewind to start.
Only a few found the neatest proof I know, which is retrograde: Before f is conquered, there must be riders on d and e. One of those must have been the last to arrive, say e. This means the immediately previous step had riders on d, b, and c. The only move before that must have been from a conquering d, so we have proved the necessity of the configuration (a,b,c). And this configuration has no predecessor. So three riders cannot conquer f.
Can a more-general, more-powerful lower bound technique be built from this kind of retrograde reasoning?
Chess answers: In the first puzzle, if no White piece had previously occupied d6, there is still a way the game could go. Look at the second diagram and picture White’s bishop on c5 with a knight on b6. White can play knight to the corner discovering check and Black’s king can take it, giving the overall moves 1. Nb6-a8+ Ka7xa8 2. e4-e5+ d7-d5 3. exd6 en-passant and checkmate.
In the second chess puzzle, the only piece White could have captured on b3 was Black’s queenside bishop. In order for it to leave its initial square c8, however, Black needed to capture on e6. The only White piece able to give itself up there was the missing knight, because White’s queen could not escape until the move a2xb3 happened. So Black’s capture on h6 was of White’s queen. I can find a legal game reaching this position (with White to play) in 18 moves; is that the minimum?
Retrograde chess puzzles become far more intricate than the examples in this column suggest. Besides Smullyan’s books, the great trove for them is maintained by Angela and Otto Janko here. Joe Kisenwether has some examples from games other than chess.
[fixed that exam problem was “to argue that…”]