Can chess statistics help design multiple-choice exams?

 UT Dallas source

Bruce Pandolfini is one of very few living people who have been played by an Oscar-winning actor. He was the real-life chess teacher of junior chess player Joshua Waitzkin, who went on to become the world champion—in T’ai Chi. Their story is told in the movie “Searching For Bobby Fischer” with Sir Ben Kingsley, which is still iconic after 20-plus years. Pandolfini is still doing what he loves as an active chess teacher in New York City. For much of this time he has also written a popular feature called “Solitaire Chess” for Chess Life magazine, which is published by the United States Chess Federation.

Today Dick and I wish to compare styles of multiple-choice exams, with reference to “Solitaire Chess,” and have some fun as well.

Most multiple-choice questions are designed to have a unique correct answer, with all other answers receiving 0 points or even a minus. This is like a chess problem of “find the winning move” type. Mate-in-2, mate-in-3, and endgame problems generally have unique answers—a “dual” solution is an esthetic blemish. There are several popular websites devoted to this kind of chess puzzle, which is great for honing one’s tactical ability.

“Solitaire Chess” is different, with more emphasis on strategy. The reader takes the winning side of a notable game Bruce has prepared, and chooses his/her move before revealing the answer and the opponent’s next move. It simulates the feeling of playing a master game.

Incidentally Bruce recently attended the wedding of another master player represented in the movie, the real-life Asa Hoffman to the former Virginia LoPresto, who also remember me from New York tournaments in the 1970’s. Among children Bruce has coached in preteen years is the world’s current 5th-ranked player, Fabiano Caruana of Brooklyn and Italy.

Solitaire Chess

The difference we emphasize is that the game positions often give partial or even full credit for alternative choices. For example, here is the position at move 22 in the March 2014 Chess Life column, of a game that was played in 1934 by Fred Reinfeld, an earlier master teacher who wrote many great books until the 1960’s. White is to play:

The top score of 5 points goes to the capture move 22.axb4, but the alternatives 22.Nc6 and 22.Ne6+ are deemed almost as good, worth 4 points each, while the non-capture move 22.a4 still gets 2 points. Several other game turns have 3-point partial credits. At the end is a chart connecting your total score over all the moves to the standard chess rating scale devised by Arpad Elo. For instance, 81–94 points is deemed the range of a 2200–2399 master player such as myself, while 36–50 is for a “good club player” with 1600–1799 rating.

Pandolfini’s expert judgment goes into setting both the partial credits and the overall assessment scale. Although chess positions often have 30–50 legal moves or even more, there are typically at most 3–5 moves worth considering, so this is like a standard multiple-choice test in that way. The partial credits, however, are more typical of ranking applications such as judging the value of search-engine hits, where there are 10, 20, 30, or hundreds or thousands of choices to consider. Our topic is about having the best of both kinds of application, and how to do the assessment scientifically.

But Let’s Have Fun

Well we guess you didn’t come to a blog to take an exam, so we’ll try to make at least the first part fun, before we introduce more “strategic” questions with partial credits. You are on your honor not to Google the answers—we can tell of course; we won’t tell you how we know but our heartbleeds for you.

1. PCP stands for which of the following?
1. Post Correspondence Problem

2. Provably Checkable Proofs

3. A recreational drug
4. Probabilistically Checkable Proofs
5. All the above
6. Some of the above
7. None of the above.
2. The first Turing Award Winner was—?
1. Alan Turing, posthumously for Turing Machines
2. John Backus for Fortran
3. Alan Perlis for Algol
4. None of the above.
3. By the terms of the Clay Millennium Prize, it would be awarded for—?
1. A proof that ${\mathsf{P=NP}}$.
2. A proof that ${\mathsf{P \neq NP}}$.
3. A proof that ${\mathsf{P=NP}}$ is independent of set theory.
4. All the above
5. Some of the above
6. None of the above.
4. ${\mathsf{NP}}$ stands for:
1. Not Polynomial (time).
2. Nick Pippenger.
3. Nondeterministic Polynomial (time).
4. Nick’s Problems.
5. ${N}$ times ${P}$.
5. Breaking RSA is equivalent to factoring integers:
1. Yes
2. No
3. Yes on Riemann Hypothesis
4. No on Riemann Hypothesis
5. Open question.
6. If ${\mathsf{BQP=BPP}}$ and ${\mathsf{P=NC}}$ then it follows that which is true?
1. ${\mathsf{P=NP}}$.
2. ${\mathsf{P \neq NP}}$
3. ${\mathsf{NL=P}}$ or ${\mathsf{NL=NC}}$.
4. All the above
5. Some of the above
6. None of the above.

Multi-Choice

OK, more serious now. Start your engines. Actually in chess, “start your engines” refers to computer chess programs, and would mean either you are cheating, or you are playing in the InfinityChess Freestyle tournament, which finishes today.

1. The main significance of ${\mathsf{SAT}}$ being shown ${\mathsf{NP}}$-complete in 1970-71 is:

1. ${\mathsf{NP}}$-complete sets weren’t known before.

2. ${\mathsf{SAT}}$ is a natural problem, and this led to many other natural problems in ${\mathsf{NP}}$ being shown to be complete.

3. Logic is a universal language.

4. ${\mathsf{SAT}}$ does not have a polynomial-time algorithm.

5. Formulas that are satisfiable can be verified after guessing a good assignment in polynomial time.
2. We know that nondeterministic linear bounded automata (NLBAs) accept a different class of languages from ${\mathsf{NP}}$ because:
1. ${\mathsf{NP}}$ allows using more than linear space.

2. ${\mathsf{NP} \neq \mathsf{NTIME}[O(n)]}$ by the nondeterministic time hierarchy theorem.

3. ${\mathsf{NP}}$ is closed downward under polynomial-time many-one reductions but the class of NLBA languages is not.

4. NLBAs can accept ${\mathsf{PSPACE}}$-complete languages and ${\mathsf{PSPACE \supset NP}}$.

5. NLBA languages are closed under complement, while ${\mathsf{NP}}$ isn’t.
3. The easiest languages currently known not to be in nonuniform ${\mathsf{ACC}}$ belong to:
1. ${\mathsf{NEXP}}$

2. ${\mathsf{EXP}}$

3. ${\mathsf{NP}}$

4. ${\mathsf{P}}$

5. ${\mathsf{NC^{1}}}$.
4. Kurt Gödel’s Second Incompleteness Theorem states that:
1. No formal system can prove its own consistency.

2. No formal system can prove its own consistency, unless it is inconsistent.

3. No formal system that extends basic arithmetic can prove its own consistency.

4. No formal system that extends basic arithmetic can prove its own consistency, unless it is inconsistent.

5. No computably axiomatizable formal system that extends basic arithmetic can prove its own consistency, unless it is inconsistent.
5. The language of undirected simple graphs that have a unique 4-edge coloring up to isomprphisms belongs to ${\mathsf{P}}$—indeed, it is just disjoint unions of star graphs. In consequence:
1. The “Four Edge Coloring” (${\mathsf{4EC}}$) problem cannot be ${\mathsf{NP}}$-complete.

2. There does not exist a parsimonious polynomial-time reduction to the ${\mathsf{4EC}}$ language unless factoring is in ${\mathsf{P}}$.

3. ${\mathsf{SAT}}$ and ${\mathsf{4EC}}$ cannot be polynomial-time isomorphic, unless factoring is in ${\mathsf{P}}$.

4. ${\mathsf{4EC}}$ is solvable in quantum polynomial time.

5. There does not exist a polynomial-time reduction from ${\mathsf{SAT}}$ to ${\mathsf{4EC}}$ that is parsimonious on the 4-edge-coloring relation, unless factoring is in ${\mathsf{P}}$.
6. Structural evidence for ${\mathsf{BPP = P}}$ is:
1. ${\mathsf{Primes}}$ and several other classic ${\mathsf{BPP}}$ problems are now known to be in ${\mathsf{P}}$.

2. Polynomial identity testing can be de-randomized.

3. It follows if there are languages in ${\mathsf{EXP}}$ that do not have ${2^{o(n)}}$-sized circuits.

4. No ${\mathsf{BPP}}$-complete problems have been found.

5. A random oracle ${A}$ makes ${\mathsf{BPP}^A = \mathsf{P}^A}$.

Getting Help With Judgment Calls

We think each of our latter six “multi-choice” questions has a clear best answer, but our judgment comes from perspectives in our field. For instance, “structural” complexity came with a specific meaning apart from algorithmic and practical considerations. Even granting that meaning, arguments can be made for several answers to the last question—all except the one that is false on current knowledge. For example, random-oracle results used to be considered stronger evidence than is commonly ascribed to them now.

We could have made catch-all “some of the above” answers as in our first set. However, this would miss our feeling of there being a pecking order even among the non-optimal answers. Again with reference to the last question, random oracles and complete languages are “structural” while the history of classifying problems is not, and between the first two, lacking a completeness level is not generally evidence of being tractable. Hence we see the possibility of better assessment by giving different partial credits to these answers.

An even more quantitative option is to ask the test-taker to rate each statement on (say) a 0–5 scale. This would be just like asking the takers to estimate the partial credits themselves. We could then score according to distributional similarity to our own assignment, weighting closeness on the best answers the most. Of course this style of grading is most appropriate to judging search engines, based on an expert reference assessment of the importance of the various ‘hits’ returned. And it is also like simulating the creation of “Solitaire Chess” itself—more than just looking for the best move which is what we do when we actually play chess. Thus the teacher has a harder task than the player.

The most ambitious goal is to turn the process around by making backwards inferences about the values of questions from the aggregated selection of many well-informed takers. In chess this would be like judging the value of a move based on the proportion of strong players who choose it. Nowadays this is regarded as overruled by the judgments of strong computer programs, notwithstanding the issue that players’ “book knowledge” of past games makes their choices less independent than among test takers.

However, the ability in chess to correlate players’ judgments with computer values of moves, and map the distributions, may help us make inferences about “objective value” from the distributions of the test-takers. The computer values are scientifically objective partial credits, and “Solitaire Chess” could be scored as an app that way. This all plays into quantifying the wisdom of crowds along lines discussed toward the end of the Distinguished Speaker lecture given by Lance Fortnow on his visit to Buffalo last week. At least this is our motive for making tests more like strategic chess.

Open Problems

What partial-credit values would you assign to our complexity questions?

Should multiple-choice tests be more like “Solitaire Chess”? Does one obtain deeper and better assessment that way? Is the difference important enough to massive online courses?

§

Here are the answers to our April Fool’s anagram quiz, besides “Pearl Gates” = Peter Sagal and “Slack Laser” = Carl Kasell:

• “Hydra Dye Frog” is Godfrey Hardy; the item is true.

• “Gail Kali” is Gil Kalai, and the bio is false—although the description of his eponymous conjecture is fine.

• “Rich-Cal Fried Sugars” is Carl Friedrich Gauss, and the Ceres story is fit to go on a box of frosted cornflakes.

• “One-Ale Hurdler” is Leonhard Euler, and the Königsberg Bridge Problem is indeed hailed as founding Graph Theory.

• “One-Cat Tree” is Terence Tao, and our extrapolation of his new result on Navier-Stokes is a flight of fancy.

• “Wes Gromit” is Tim Gowers, and we twisted past all the great work by him and Tao and “Gene Bern” = Ben Green on arithmetic progressions with a false statement—one falsified by the older theorem of Gustav Dirichlet that every infinite arithmetic progression without a common factor has infinitely many primes.

• “Annual Trig” is Alan Turing, true item.

• “Town Falconer” is Lance Fortnow, and we stated things backwards: he and “Semi Spiker” = Mike Sipser gave an oracle ${A}$ for ${\mathsf{IP}^A \not\subseteq \mathsf{co}\text{-}\mathsf{NP}^A}$, while “Amish Raid” = Adi Shamir finished the oracle-exorcising result ${\mathsf{IP = PSPACE}}$.

• “Joint Who Tolled” is John Littlewood, true description.

• “Glacial Warmish” is William Gasarch, but as with “Falconer” the theorem statement is backwards.

• “Glib Tales” is a well-known anagram of Bill Gates, and yes he was mated in 9 moves.

• “Sonata Consort” is Scott Aaronson, and the opposite of our first statement is proclaimed on the masthead of his blog. Indeed all six “fools” are bloggers like us…

1. April 10, 2014 8:37 pm

If you choose from these responses uniformly at random, what is the probability you will be correct?

a) 1/4
b) 1/4
c) 1/2
d) 0

(Extra credit – who first made up this little delight? I’d like to give them credit, but I don’t remember…)

2. April 11, 2014 7:46 am

Dick and Ken, thank you for these fun (and wonderful) questions.

Here is a light-hearted question (yet open and seemingly difficult) in the same class as your Question 3:

Question Which of following five Clay Institute Millenium Prize Problems have the attribute undecidable → true?

i. Yang–Mills and Mass Gap
ii. P vs NP Problem
iii. Navier–Stokes Equation
iv. Hodge Conjecture
V. Birch and Swinnerton-Dyer Conjecture

Followup Question Do any of the five have the attribute undecidable → false?

As the MathOverflow question For which Millennium Problems does undecidable -> true?, this question has been well-rated yet unanswered since September 2012, and therefore (inspired by GLL) a bounty now is offered for it.

April 11, 2014 9:23 pm

As a followup to the above question — which is attracting some high-power answers — Allyn Jackson’s Notices of the AMS article Interview with Martin Davis (2008) is commended (to students especially) as a lucid, warm-hearted discussion of mathematical decidability in general. It’s fun to learn the reasons why Martin Davis’ name can be added to Gödel’s Lost Letter’s illustrious roster of PvsNP mathematician-agnostics.

3. April 12, 2014 8:46 am

Note that John has omitted the Riemann Hypothesis, since as covered in a relevant early post by Dick, it is equivalent to a universal formula (that is, “Pi_One”). Hence as with Fermat’s Last Theorem, undecidability is equivalent to truth for it. Incidentally, (Jeff) Lagarias who is credited there is the first name I thought of for Dick’s “Glacial Warmish” anagram.

We also considered “inflating” the idea given here into the full-blown April Fool’s post, to be semi-serious like last year’s meteorite post. The doubling detail in this sprightly NYT column by Max Tegmark makes me wonder if it’s a bit less fanciful…

• April 14, 2014 4:37 am

Ken is right. It’s worth noting too that the “GLL Multiple-Credit Problems” and the “Clay Mathematics Institute (CMI) Millennium Prize Problems” are alike in that their rules are sufficiently elastic as to encompass “burning arrow” modifications of the questions-asked. In particular, the CMI’s Prize Rules explicitly accommodate “burning arrow” problem-modifications via the following open-ended language:

“The publication of a plausible solution to one of them [the Prize Problems] will not go unnoticed, although it may take some years for it to receive general acceptance. The CMI’s Scientific Advisory Board (SAB) alone will decide when that point has been reached.” … “The ‘general acceptance’ condition is a positive one. It is not met by default by a paper that has generated little or no reaction since publication. A solution that has been generally accepted will have attracted a large number of citations by independent researchers in the field, will have been the subject of discussion at international meetings, will have received detailed scrutiny in papers by people other than the original authors, and will very likely have been recognised by other awards.” … [And in conclusion] “the SAB [Scientific Advisory Board] may consider recommending the award of the prize to an individual who has published work that, in the judgement of the SAB, fully resolves the questions raised by one of the Millennium Prize Problems even if it does not exactly meet the wording in the official problem description.”

Conclusion  The CMI’s open-ended prize-rules are outstandingly well-considered (as it seems to me).

4. April 24, 2014 10:57 pm

I wonder if there is a chess position such that:

There is a mate in 1 for the player to move.

If that move is not taken, there is a mate in 1 for the other player.

If that move is not taken, there is a mate in 1 for the first player.

And so on for some number of moves.

The position and moves should be “reasonable” and not completely artificial.

As for me and chess, I was the kind of player who would think of a move, decide that it was poor, consider and reject a number of other possible moves, then make my originally rejected poor move because I forgot why I rejected it.