MultipleCredit Tests
Can chess statistics help design multiplechoice exams?
UT Dallas source 
Bruce Pandolfini is one of very few living people who have been played by an Oscarwinning actor. He was the reallife chess teacher of junior chess player Joshua Waitzkin, who went on to become the world champion—in T’ai Chi. Their story is told in the movie “Searching For Bobby Fischer” with Sir Ben Kingsley, which is still iconic after 20plus years. Pandolfini is still doing what he loves as an active chess teacher in New York City. For much of this time he has also written a popular feature called “Solitaire Chess” for Chess Life magazine, which is published by the United States Chess Federation.
Today Dick and I wish to compare styles of multiplechoice exams, with reference to “Solitaire Chess,” and have some fun as well.
Most multiplechoice questions are designed to have a unique correct answer, with all other answers receiving 0 points or even a minus. This is like a chess problem of “find the winning move” type. Matein2, matein3, and endgame problems generally have unique answers—a “dual” solution is an esthetic blemish. There are several popular websites devoted to this kind of chess puzzle, which is great for honing one’s tactical ability.
“Solitaire Chess” is different, with more emphasis on strategy. The reader takes the winning side of a notable game Bruce has prepared, and chooses his/her move before revealing the answer and the opponent’s next move. It simulates the feeling of playing a master game.
Incidentally Bruce recently attended the wedding of another master player represented in the movie, the reallife Asa Hoffman to the former Virginia LoPresto, who also remember me from New York tournaments in the 1970’s. Among children Bruce has coached in preteen years is the world’s current 5thranked player, Fabiano Caruana of Brooklyn and Italy.
Solitaire Chess
The difference we emphasize is that the game positions often give partial or even full credit for alternative choices. For example, here is the position at move 22 in the March 2014 Chess Life column, of a game that was played in 1934 by Fred Reinfeld, an earlier master teacher who wrote many great books until the 1960’s. White is to play:
The top score of 5 points goes to the capture move 22.axb4, but the alternatives 22.Nc6 and 22.Ne6+ are deemed almost as good, worth 4 points each, while the noncapture move 22.a4 still gets 2 points. Several other game turns have 3point partial credits. At the end is a chart connecting your total score over all the moves to the standard chess rating scale devised by Arpad Elo. For instance, 81–94 points is deemed the range of a 2200–2399 master player such as myself, while 36–50 is for a “good club player” with 1600–1799 rating.
Pandolfini’s expert judgment goes into setting both the partial credits and the overall assessment scale. Although chess positions often have 30–50 legal moves or even more, there are typically at most 3–5 moves worth considering, so this is like a standard multiplechoice test in that way. The partial credits, however, are more typical of ranking applications such as judging the value of searchengine hits, where there are 10, 20, 30, or hundreds or thousands of choices to consider. Our topic is about having the best of both kinds of application, and how to do the assessment scientifically.
But Let’s Have Fun
Well we guess you didn’t come to a blog to take an exam, so we’ll try to make at least the first part fun, before we introduce more “strategic” questions with partial credits. You are on your honor not to Google the answers—we can tell of course; we won’t tell you how we know but our heartbleeds for you.
 PCP stands for which of the following?
 Post Correspondence Problem
 Provably Checkable Proofs
 A recreational drug
 Probabilistically Checkable Proofs
 All the above
 Some of the above
 None of the above.
 The first Turing Award Winner was—?
 Alan Turing, posthumously for Turing Machines
 John Backus for Fortran
 Alan Perlis for Algol
 None of the above.
 By the terms of the Clay Millennium Prize, it would be awarded for—?
 A proof that .
 A proof that .
 A proof that is independent of set theory.
 All the above
 Some of the above
 None of the above.
 stands for:
 Not Polynomial (time).
 Nick Pippenger.
 Nondeterministic Polynomial (time).
 Nick’s Problems.
 times .
 Breaking RSA is equivalent to factoring integers:
 Yes
 No
 Yes on Riemann Hypothesis
 No on Riemann Hypothesis
 Open question.
 If and then it follows that which is true?
 .
 or .
 All the above
 Some of the above
 None of the above.
MultiChoice
OK, more serious now. Start your engines. Actually in chess, “start your engines” refers to computer chess programs, and would mean either you are cheating, or you are playing in the InfinityChess Freestyle tournament, which finishes today.

The main significance of being shown complete in 197071 is:
 complete sets weren’t known before.
 is a natural problem, and this led to many other natural problems in being shown to be complete.
 Logic is a universal language.
 does not have a polynomialtime algorithm.
 Formulas that are satisfiable can be verified after guessing a good assignment in polynomial time.
 We know that nondeterministic linear bounded automata (NLBAs) accept a different class of languages from because:
 allows using more than linear space.
 by the nondeterministic time hierarchy theorem.
 is closed downward under polynomialtime manyone reductions but the class of NLBA languages is not.
 NLBAs can accept complete languages and .
 NLBA languages are closed under complement, while isn’t.
 The easiest languages currently known not to be in nonuniform belong to:
 .
 Kurt Gödel’s Second Incompleteness Theorem states that:
 No formal system can prove its own consistency.
 No formal system can prove its own consistency, unless it is inconsistent.
 No formal system that extends basic arithmetic can prove its own consistency.
 No formal system that extends basic arithmetic can prove its own consistency, unless it is inconsistent.
 No computably axiomatizable formal system that extends basic arithmetic can prove its own consistency, unless it is inconsistent.
 The language of undirected simple graphs that have a unique 4edge coloring up to isomprphisms belongs to —indeed, it is just disjoint unions of star graphs. In consequence:
 The “Four Edge Coloring” () problem cannot be complete.
 There does not exist a parsimonious polynomialtime reduction to the language unless factoring is in .
 and cannot be polynomialtime isomorphic, unless factoring is in .
 is solvable in quantum polynomial time.
 There does not exist a polynomialtime reduction from to that is parsimonious on the 4edgecoloring relation, unless factoring is in .
 Structural evidence for is:
 and several other classic problems are now known to be in .
 Polynomial identity testing can be derandomized.
 It follows if there are languages in that do not have sized circuits.
 No complete problems have been found.
 A random oracle makes .
Getting Help With Judgment Calls
We think each of our latter six “multichoice” questions has a clear best answer, but our judgment comes from perspectives in our field. For instance, “structural” complexity came with a specific meaning apart from algorithmic and practical considerations. Even granting that meaning, arguments can be made for several answers to the last question—all except the one that is false on current knowledge. For example, randomoracle results used to be considered stronger evidence than is commonly ascribed to them now.
We could have made catchall “some of the above” answers as in our first set. However, this would miss our feeling of there being a pecking order even among the nonoptimal answers. Again with reference to the last question, random oracles and complete languages are “structural” while the history of classifying problems is not, and between the first two, lacking a completeness level is not generally evidence of being tractable. Hence we see the possibility of better assessment by giving different partial credits to these answers.
An even more quantitative option is to ask the testtaker to rate each statement on (say) a 0–5 scale. This would be just like asking the takers to estimate the partial credits themselves. We could then score according to distributional similarity to our own assignment, weighting closeness on the best answers the most. Of course this style of grading is most appropriate to judging search engines, based on an expert reference assessment of the importance of the various ‘hits’ returned. And it is also like simulating the creation of “Solitaire Chess” itself—more than just looking for the best move which is what we do when we actually play chess. Thus the teacher has a harder task than the player.
The most ambitious goal is to turn the process around by making backwards inferences about the values of questions from the aggregated selection of many wellinformed takers. In chess this would be like judging the value of a move based on the proportion of strong players who choose it. Nowadays this is regarded as overruled by the judgments of strong computer programs, notwithstanding the issue that players’ “book knowledge” of past games makes their choices less independent than among test takers.
However, the ability in chess to correlate players’ judgments with computer values of moves, and map the distributions, may help us make inferences about “objective value” from the distributions of the testtakers. The computer values are scientifically objective partial credits, and “Solitaire Chess” could be scored as an app that way. This all plays into quantifying the wisdom of crowds along lines discussed toward the end of the Distinguished Speaker lecture given by Lance Fortnow on his visit to Buffalo last week. At least this is our motive for making tests more like strategic chess.
Open Problems
What partialcredit values would you assign to our complexity questions?
Should multiplechoice tests be more like “Solitaire Chess”? Does one obtain deeper and better assessment that way? Is the difference important enough to massive online courses?
§
Here are the answers to our April Fool’s anagram quiz, besides “Pearl Gates” = Peter Sagal and “Slack Laser” = Carl Kasell:
 “Hydra Dye Frog” is Godfrey Hardy; the item is true.
 “Gail Kali” is Gil Kalai, and the bio is false—although the description of his eponymous conjecture is fine.
 “RichCal Fried Sugars” is Carl Friedrich Gauss, and the Ceres story is fit to go on a box of frosted cornflakes.
 “OneAle Hurdler” is Leonhard Euler, and the Königsberg Bridge Problem is indeed hailed as founding Graph Theory.
 “OneCat Tree” is Terence Tao, and our extrapolation of his new result on NavierStokes is a flight of fancy.
 “Wes Gromit” is Tim Gowers, and we twisted past all the great work by him and Tao and “Gene Bern” = Ben Green on arithmetic progressions with a false statement—one falsified by the older theorem of Gustav Dirichlet that every infinite arithmetic progression without a common factor has infinitely many primes.
 “Annual Trig” is Alan Turing, true item.
 “Town Falconer” is Lance Fortnow, and we stated things backwards: he and “Semi Spiker” = Mike Sipser gave an oracle for , while “Amish Raid” = Adi Shamir finished the oracleexorcising result .
 “Joint Who Tolled” is John Littlewood, true description.
 “Glacial Warmish” is William Gasarch, but as with “Falconer” the theorem statement is backwards.
 “Glib Tales” is a wellknown anagram of Bill Gates, and yes he was mated in 9 moves.
 “Sonata Consort” is Scott Aaronson, and the opposite of our first statement is proclaimed on the masthead of his blog. Indeed all six “fools” are bloggers like us…
If you choose from these responses uniformly at random, what is the probability you will be correct?
a) 1/4
b) 1/4
c) 1/2
d) 0
(Extra credit – who first made up this little delight? I’d like to give them credit, but I don’t remember…)
Dick and Ken, thank you for these fun (and wonderful) questions.
Here is a lighthearted question (yet open and seemingly difficult) in the same class as your Question 3:
As the MathOverflow question “For which Millennium Problems does undecidable > true?“, this question has been wellrated yet unanswered since September 2012, and therefore (inspired by GLL) a bounty now is offered for it.
Have fun … and please consider posting serious answers to MathOverflow.
As a followup to the above question — which is attracting some highpower answers — Allyn Jackson’s Notices of the AMS article “Interview with Martin Davis“ (2008) is commended (to students especially) as a lucid, warmhearted discussion of mathematical decidability in general. It’s fun to learn the reasons why Martin Davis’ name can be added to Gödel’s Lost Letter’s illustrious roster of PvsNP mathematicianagnostics.
Note that John has omitted the Riemann Hypothesis, since as covered in a relevant early post by Dick, it is equivalent to a universal formula (that is, “Pi_One”). Hence as with Fermat’s Last Theorem, undecidability is equivalent to truth for it. Incidentally, (Jeff) Lagarias who is credited there is the first name I thought of for Dick’s “Glacial Warmish” anagram.
We also considered “inflating” the idea given here into the fullblown April Fool’s post, to be semiserious like last year’s meteorite post. The doubling detail in this sprightly NYT column by Max Tegmark makes me wonder if it’s a bit less fanciful…
Ken is right. It’s worth noting too that the “GLL MultipleCredit Problems” and the “Clay Mathematics Institute (CMI) Millennium Prize Problems” are alike in that their rules are sufficiently elastic as to encompass “burning arrow” modifications of the questionsasked. In particular, the CMI’s Prize Rules explicitly accommodate “burning arrow” problemmodifications via the following openended language:
Conclusion The CMI’s openended prizerules are outstandingly wellconsidered (as it seems to me).
Students, light your arrows! 🙂
I wonder if there is a chess position such that:
There is a mate in 1 for the player to move.
If that move is not taken, there is a mate in 1 for the other player.
If that move is not taken, there is a mate in 1 for the first player.
…
And so on for some number of moves.
The position and moves should be “reasonable” and not completely artificial.
As for me and chess, I was the kind of player who would think of a move, decide that it was poor, consider and reject a number of other possible moves, then make my originally rejected poor move because I forgot why I rejected it.