Can we help avoid parallel repetition of mistakes?

Irit Dinur has recently again shown a wonderful skill at re-conceptualizing an area that had seemingly been well worked out. A notable previous instance was her re-casting the proof of the PCP Theorem as a progressive amplification. Now she and David Steurer have posted a new paper whose title, “Analytical Approach to Parallel Repetition,” introduces a new framework. The subject of parallel repetition, which they call “a basic product operation for one-round two-player games,” is distinctive in having its genesis in a mistake made in a paper—a trap of automatic thinking. Lance Fortnow and Mike Sipser thought that executing multiple instances of a randomized protocol in parallel would have the same power-law reduction in the probability of overall failure as executing them independently in sequence, overlooking how the strategies in the parallel cases could be dependent and exploit this.

Today we talk about similar traps that people can fall into even at advanced graduate level. How might they be avoided?

Truth-in-blogging note: this post is really about a different case of products and independence, and most of it was written months ago. It was lacking an intro section with someone to feature according to our “blog invariant,” and we also wanted a few short examples of graduate-student traps in computational theory and mathematics before progressing to the main one. The parallel repetition example came not only first to mind, but also second and third and fourth… as Dick and I struggled to think of more good ones. Lance’s haute faute has already been mentioned a few times on this blog, and I thought it would make a tiresome and repetitious parallel to my own “trap.” It didn’t help that the neat example I saw online years ago which furnished the phrase “graduate-student trap”—but which I didn’t preserve and forgot—has evidently vanished into unsearchability.

I was trying to decide between leading with the late-medieval mathematician-theologian Nicholas de Cusa for his advice on not pretending to have completed knowledge, or alternately a colleague in Buffalo who has compiled a good graduate-student advice list. Lance has similar advice, and when looking for it I spotted the mention of Dinur in his Twitter feed—actually Lance’s re-tweet of one by my former student D. Sivakumar. Voilà—Lance’s example it was. Thanks, Irit.

## Traps

What is a trap? Pitfalls and paradoxes abound in mathematics and the sciences, and surmounting them is just part of acquiring the literature. Sometimes it is a confounding of preconceived expectations, but it is hard to find a way of defining such expectations that works for everybody or even most people. What makes a trap, in my opinion, is there being concrete indications in the concepts, in their contextual use, in their notation, or in the literature itself that run counter to the truth. Here are what strike Dick and me as a few simple but significant examples:

${\bullet}$ Square root is not a function. It is written like a function, but isn’t. Here is an example of what you can “do” by conveniently forgetting this: ${-1/1 = 1/-1}$, so take square roots of both sides. You get

$\displaystyle \frac{\sqrt{-1}}{\sqrt{1}} = \frac{\sqrt{1}}{\sqrt{-1}}\quad\text{so}\quad \frac{i}{1} = \frac{1}{i}\quad\text{so}\quad i^2 = 1.$

This contradicts the definition ${i^2 = -1}$.

${\bullet}$ Not all matrices are diagonalizable. Since even many singular matrices are diagonalizable, it is easy to forget this is not true in general. If it were true, then there would be a really quick proof that a matrix ${A}$ always satisfies its characteristic polynomial ${p}$. Namely, let ${A = P D P^{-1}}$ with ${D}$ the diagonal matrix of eigenvalues. Then on substituting the right-hand side into the formula ${p(A)}$, all the ${P}$‘s and ${P^{-1}}$‘s cancel except for one ${P}$ in front and one ${P^{-1}}$ in back. The rest is the component-wise evaluation ${p(\lambda)}$ on each eigenvalue ${\lambda}$ in ${D}$, which identically vanishes, leaving the all-zero matrix.

Well this is often not a bad error to make. Every matrix ${A}$ has arbitrarily close perturbed forms ${A'}$ that are diagonalizable, indeed have distinct eigenvalues. The above proof gives ${p'(A') = 0}$ where the characteristic polynomial ${p'}$ is coefficient-wise close to ${p}$. Continuity then implies ${p(A) = 0}$.

${\bullet}$ ${\mathbb{Z}_{p^k}}$ is not the same as the field ${GF(p^k)}$. The former is not a field, as it has zero divisors. The multiplicative subgroup formed by the elements that are not multiples of ${p}$ is not a field either. But this is again not always a bad error to make, even in crypto. A lot of properties and problems are similar between the structures.

These are really at high school or undergraduate level, before the research stage. What kind of traps matter at research level?

## My Trap

My own strongest feeling of falling into a “graduate student trap” came in October 2006, as I began my work on statistical claims of cheating with computers at chess that had arisen during the world championship match that month and before. I modeled a human player and a computer as distributions ${P}$ and ${Q}$ over the choices of available moves in game positions. Cheating would depend on how close the ensemble of played moves was to ${P}$ vis-à-vis ${Q}$, so I wanted a suitable distance measure ${d(P,Q)}$ between distributions. Modeling the computer as a distribution not only allows for different chess-engine program versions and parameter settings, but also for a steady amount of small variation caused by hash collisions—as I described first here and mainly here.

I decided to postulate that for two different (sets of) positions ${S_1}$ and ${S_2}$, the player’s distributions ${P_1,P_2}$ would be independent, and similarly ${Q_1,Q_2}$ for the computer. This makes the joint distributions ${P}$ and ${Q}$ on ${S_1 \cup S_2}$ satisfy ${P = P_1 \times P_2}$ and ${Q = Q_1 \times Q_2}$. So that I could group game turns as I wished, I wanted the distance measure to be additive, namely

$\displaystyle d(P,Q) = d(P_1 \times P_2, Q_1 \times Q_2) = d(P_1,Q_1) + d(P_2,Q_2).$

The first distance measure I considered, called Kullback-Leibler (K-L) divergence, is defined (on discrete domains ${X}$) by

$\displaystyle \kappa(P || Q) = \sum_{x \in X}P(x)\ln\frac{P(x)}{Q(x)}.$

Its Wikipedia page says straight out that ${\kappa}$ is additive. Great, I thought.

Unfortunately, K-L is not symmetric, and more of concern to me, ${\kappa}$ approaches ${+\infty}$ whenever there are events ${x}$ for which ${Q(x)}$ is tiny but ${P(x)}$ is not. In chess, such events would be moves the computer recognizes as bad but that players still fall into, or are tempted by. This was a concern because chess positions can have many bad moves, so that the “tail” of the move distribution could distort the value of ${\kappa}$. I could switch around ${P}$ and ${Q}$ to avoid this, but then reasonable moves shunned by players would cause other distortion.

## Is Jensen-Shannon Divergence Additive?

Applications employing distributional divergence measures were new to me, but it so happened that my department’s Distinguished Alumni Speaker that month knew something about them. After hearing my issues, he—I won’t name the “guilty party,” though I already did—suggested using Jensen-Shannon (J-S) divergence instead. J-S reduces this distortion by employing the interpolated distribution ${R = \frac{1}{2}P + \frac{1}{2}Q}$. Then it is defined from two invocations of K-L by

$\displaystyle \eta(P,Q) = \frac{1}{2}\kappa(P || R) + \frac{1}{2}\kappa(Q || R).$

This always gives finite values, and is symmetric—hence the use of comma not ${||}$. Analogous to how the sum of squared differences, which is obviously additive on product vectors, is the square of the Euclidean metric, ${\eta}$ is also the square of a metric. All this, plus the absence of contrary information, plus the frequent words “J-S is a symmetrized and smoothed version of K-L,” naturally made me assume that ${\eta}$ was additive. Grateful for the tip, I happily started on the machinery to apply it for chess.

A week later I started drafting a paper describing my concept and model, and decided it would be good to include a proof that J-S divergence is additive. Then, only then, is when I discovered with an electric shock:

It isn’t.

I’ll leave the task of actually constructing counterexamples to the reader, but here’s an intuition. It uses a generalizing trick that reminds me of one by Bob Vaughan that we covered last July. For any ${\lambda}$ put ${R' = \lambda P + (1 - \lambda)Q}$, and then define

$\displaystyle \eta'_{\lambda}(P,Q) = \lambda \kappa(P||R') + (1 - \lambda)\kappa(Q||R').$

A little reflection may convince you that this cannot be additive for all ${\lambda}$. Hence its being additive for ${\lambda = \frac{1}{2}}$, which yields ${\eta}$, would be an “accident.” Finally thinking how ${P}$ and ${Q}$ themselves can give-and-go with ${\lambda}$ gives the inkling that the accident doesn’t happen.

## How Clear in the Literature?

I can put this in the form of a “blog-beg” (sometimes called a bleg):

Can you find an easily-accessed source that says clearly that the basic Jensen-Shannon divergence is not additive?

As of this writing, the Wikipedia page on J-S still does have such a statement. Adding one yourself would be cheating. In 2006 I did not find one elsewhere, even in a couple of texts. My one-hour trial by Google when I first drafted this post last summer found one paper in 2008 titled “Nonextensive Generalizations of the Jensen-Shannon Divergence.” This clued me in that nonextensive is the antonym of additive. So the authors’ generalizations are not additive, but what about the original ${\eta}$?

Another paper I found had the promising title “Properties of Classical and Quantum Jensen-Shannon Divergence,” and even more, its first author and I have Harry Burhman as a common coauthor. It defines J-S with a bang in the opening sentence, states some generalizations of J-S, and (still on page 1) says the magic word:

Shannon entropy is additive in the sense that the entropy of independent random variables, defined as the entropy of their joint distribution, is the sum of their individual entropies. (emphasis in original)

But the next sentence brings up the different topic of Rényi entropy, and after a mention of “non-extensive (i.e. nonadditive) generalizations” of J-S it goes into quantum.

Another paper picks up the thread in its title, “Nonextensive Generalizations of the Jensen-Shannon Divergence.” The point of the first two words must be that the original J-S is additive, yes? It’s the generalizations that are non-additive. Right? The paper’s abstract says it builds something called the Jensen-Tsallis ${q}$-difference, “which is a nonextensive generalization of the JSD.” So the original J-S is extensive, then? After defining additivity, it brings up the generalizations in which

… the additivity property is abandoned, yielding the so-called nonextensive entropies.

The next paragraph introduces K-L and J-S, but doesn’t tell me whether the “abandoned” property was ever there. It seems that this simple knowledge is presumed, but how might a bright young graduate student—or an old one—find it in the first place?

## Open Problems

Can you give some more examples of “graduate-student traps”? Ones that are helpful to know?

Is Jensen-Shannon divergence close to being additive, in some useful sense? This actually strikes me as a non-trappy, research-worthy question.

28 Comments leave one →
May 16, 2013 8:27 am

$Tr[AB]=Tr[BA]$
$Tr[ABC] = Tr[CAB] \neqTr[CBA]$

May 16, 2013 8:33 am

How is JS divergence useful in computational complexity?

3. Joshua Grochow permalink
May 16, 2013 9:30 am

Here is a closely related question on cstheory.SE, with more examples of potential “traps”: http://cstheory.stackexchange.com/questions/3616/common-false-beliefs-in-theoretical-computer-science

(In fact, Ken, you could add the examples from your post as answers.)

May 16, 2013 9:31 am

blogger trap: not putting http before arxiv URLs ;-)

• May 16, 2013 10:16 am

Thanks—I fixed the bad one. To answer about JS divergence in complexity, what I know generally offhand is that notions of distributional closeness are central to many hardness-versus-randomness type applications. Google finds me this paper which gets more specific about JS and complexity.

May 16, 2013 12:47 pm

Very interesting. Thankyou.

5. May 16, 2013 12:12 pm

I don’t get your point about square root! This a function. What you prove is that it is not compatible with multiplication (this is not an endomorphism of the monoid of complex numbers under multiplication). Still I agree that it can be a trap, but the title is misleading…

• May 16, 2013 1:37 pm

I think it’s a matter of interpretation. Maybe most accurate is to say there is no functional refinement of the square-root multi-function that is endomorphic on C, unlike the positive branch of sqrt over the nonnegative reals. As with logarithm, I think it’s having multiple branches that is fundamental to the failure of what we might expect from the reals.

May 16, 2013 1:38 pm

sqrt *is* a function. The problem is not with sqrt. The problem is running with i * i = -1 as if it is an axiom consistent with the rest of the axioms of arithmetic, which apparently leads to the inconsistent result that i * i also equals 1.

Also, I’ve never heard of “X is not a function” before. Then, what is it? Are we defining a new logic with new types of elements? What are the exact rules concerning these new types of elements? What is known about this logic (completeness, etc.)?

May 20, 2013 8:16 pm

From what I can find out, this is caused by the definition of sqrt. For reals sqrt(x) -> y such that y>=0 and y*y=x. However, letting i*i = -1 doesn’t quite allow us to claim that sqrt(-1) = i, because both -i and +i are candidates and neither of them can be compared to 0. +i is chosen by convention only from what I can see.

However, any of the choices invalidates the identity sqrt(a) * sqrt(b) = sqrt(a*b) because sqrt(-1) * sqrt(-1) is not equal to sqrt(-1 * -1) = 1.

The only way out that I can see is that declaring sqrt(-1) undefined.

6. May 16, 2013 2:10 pm

The most common trap I run across has to do with Triadic Relation Irreducibility (TRI), as noted and treated by the polymathematician C.S. Peirce.

This trap lies in the mistaken belief that every 3-place (triadic or ternary) relation can be analyzed purely in terms of 2-place (dyadic or binary) relations — “purely” here meaning without resorting to any 3-place relations in the process.

A notable thinker who not only fell but led many others into this trap is none other than René Descartes, whose problematic maxim I noted in the following post.

• May 17, 2013 6:40 am

Here is a budget of links to further exposition —

• U. S. permalink
May 17, 2013 11:32 pm

That’s very interesting! I had never considered this as a common fallacy, though I frequently encounter this issue.

Incidentally, my very first research project was exactly about measuring the difference between a triadic relation and its subsumed dyadic relations.

• May 18, 2013 8:30 pm

As mathematical traps go, this one is hydra-headed.

I don’t know how to put a prior restraint on the varieties of reducibility that might be considered, but usually we are talking about either one of two types of reducibility.

Compositional Reducibility. All triadic relations are irreducible under relational composition, since the composition of two dyadic relations is a dyadic relation, by the definition of relational composition.

Projective Reducibility. We take projections of a triadic relation $L \subseteq X \times Y \times Z$ on the coordinate planes $X \times Y,$ $X \times Z,$ $Y \times Z$ and ask whether these dyadic relations uniquely determine $L.$ If so, we say $L$ is projectively reducible, otherwise it is projectively irreducible.

7. don't pin them on us! permalink
May 16, 2013 3:22 pm

As a recent student, I (mildly) resent the condescension in the term “graduate student trap.” Especially when it’s used to describe mistakes made by professors.

• May 16, 2013 6:10 pm

Intendedly the opposite: higher than a high-school or college trap. I debated using the phrase “traps that could catch a Heffalump”, but that would have meant another paragraph explaining a well-worn reference. In grad school we teach the field to “youse guys” as peers. Anyway, that was the phrase I recall reading sometime prior to Oct. 2006, and which stuck in my mind at the time.

8. May 17, 2013 5:12 am

I’m total agree witch you guys.

9. May 19, 2013 2:43 pm

The induction trap: Lets assume the statement for n and prove it for n+1 (after checking the base case). Now the statement for n looks already strong enough! Can’t we just be happy with the statement for n? If we assume the statement for n cant we just substitute n+1 for n? While proving things by induction you sometimes feel much closer to your target than you really are.

• Javaid Aslam permalink
May 21, 2013 7:59 pm

And sometimes, even going from 1 to 2 seems problematic.

• June 11, 2013 9:14 pm

And then there is the infinite inductive loop trap: you start proving some hypothesis by induction, you fall barely short to derive the n+1 case from n, so you go back, strengthen your hypothesis a bit and declare victory. But you should’ve checked if the inductive proof the stronger hypothesis goes through

May 20, 2013 3:55 pm

I fell into the following trap:

A sequence of randomly generated integers is
almost always monotonically increasing.
Proof: Let the first number be n1.
Then there are n1 numbers below it and infinitely
many above it. Therefore randomly selected n2
will almost surely satisfy n2>n1, etc.

May 21, 2013 2:26 pm

The trap here being that there’s no uniform probability distribution on the integers.

May 21, 2013 6:07 pm

Indeed, one should define what it means for an arbitrary integer to be “randomly generated”. Whenever an algorithm chooses one randomly, it’s always among a finite set.

11. May 29, 2013 4:22 pm

Hey, my favorite proof of Cayley-Hamilton’s theorem (because I can remember it). ;-)