Skip to content

Graduate Student Traps

May 15, 2013


Can we help avoid parallel repetition of mistakes?

irit

Irit Dinur has recently again shown a wonderful skill at re-conceptualizing an area that had seemingly been well worked out. A notable previous instance was her re-casting the proof of the PCP Theorem as a progressive amplification. Now she and David Steurer have posted a new paper whose title, “Analytical Approach to Parallel Repetition,” introduces a new framework. The subject of parallel repetition, which they call “a basic product operation for one-round two-player games,” is distinctive in having its genesis in a mistake made in a paper—a trap of automatic thinking. Lance Fortnow and Mike Sipser thought that executing multiple instances of a randomized protocol in parallel would have the same power-law reduction in the probability of overall failure as executing them independently in sequence, overlooking how the strategies in the parallel cases could be dependent and exploit this.

Today we talk about similar traps that people can fall into even at advanced graduate level. How might they be avoided?

Truth-in-blogging note: this post is really about a different case of products and independence, and most of it was written months ago. It was lacking an intro section with someone to feature according to our “blog invariant,” and we also wanted a few short examples of graduate-student traps in computational theory and mathematics before progressing to the main one. The parallel repetition example came not only first to mind, but also second and third and fourth… as Dick and I struggled to think of more good ones. Lance’s haute faute has already been mentioned a few times on this blog, and I thought it would make a tiresome and repetitious parallel to my own “trap.” It didn’t help that the neat example I saw online years ago which furnished the phrase “graduate-student trap”—but which I didn’t preserve and forgot—has evidently vanished into unsearchability.

I was trying to decide between leading with the late-medieval mathematician-theologian Nicholas de Cusa for his advice on not pretending to have completed knowledge, or alternately a colleague in Buffalo who has compiled a good graduate-student advice list. Lance has similar advice, and when looking for it I spotted the mention of Dinur in his Twitter feed—actually Lance’s re-tweet of one by my former student D. Sivakumar. Voilà—Lance’s example it was. Thanks, Irit.

Traps

What is a trap? Pitfalls and paradoxes abound in mathematics and the sciences, and surmounting them is just part of acquiring the literature. Sometimes it is a confounding of preconceived expectations, but it is hard to find a way of defining such expectations that works for everybody or even most people. What makes a trap, in my opinion, is there being concrete indications in the concepts, in their contextual use, in their notation, or in the literature itself that run counter to the truth. Here are what strike Dick and me as a few simple but significant examples:

{\bullet} Square root is not a function. It is written like a function, but isn’t. Here is an example of what you can “do” by conveniently forgetting this: {-1/1 = 1/-1}, so take square roots of both sides. You get

\displaystyle \frac{\sqrt{-1}}{\sqrt{1}} = \frac{\sqrt{1}}{\sqrt{-1}}\quad\text{so}\quad \frac{i}{1} = \frac{1}{i}\quad\text{so}\quad i^2 = 1.

This contradicts the definition {i^2 = -1}.

{\bullet} Not all matrices are diagonalizable. Since even many singular matrices are diagonalizable, it is easy to forget this is not true in general. If it were true, then there would be a really quick proof that a matrix {A} always satisfies its characteristic polynomial {p}. Namely, let {A = P D P^{-1}} with {D} the diagonal matrix of eigenvalues. Then on substituting the right-hand side into the formula {p(A)}, all the {P}‘s and {P^{-1}}‘s cancel except for one {P} in front and one {P^{-1}} in back. The rest is the component-wise evaluation {p(\lambda)} on each eigenvalue {\lambda} in {D}, which identically vanishes, leaving the all-zero matrix.

Well this is often not a bad error to make. Every matrix {A} has arbitrarily close perturbed forms {A'} that are diagonalizable, indeed have distinct eigenvalues. The above proof gives {p'(A') = 0} where the characteristic polynomial {p'} is coefficient-wise close to {p}. Continuity then implies {p(A) = 0}.

{\bullet} {\mathbb{Z}_{p^k}} is not the same as the field {GF(p^k)}. The former is not a field, as it has zero divisors. The multiplicative subgroup formed by the elements that are not multiples of {p} is not a field either. But this is again not always a bad error to make, even in crypto. A lot of properties and problems are similar between the structures.

These are really at high school or undergraduate level, before the research stage. What kind of traps matter at research level?

My Trap

My own strongest feeling of falling into a “graduate student trap” came in October 2006, as I began my work on statistical claims of cheating with computers at chess that had arisen during the world championship match that month and before. I modeled a human player and a computer as distributions {P} and {Q} over the choices of available moves in game positions. Cheating would depend on how close the ensemble of played moves was to {P} vis-à-vis {Q}, so I wanted a suitable distance measure {d(P,Q)} between distributions. Modeling the computer as a distribution not only allows for different chess-engine program versions and parameter settings, but also for a steady amount of small variation caused by hash collisions—as I described first here and mainly here.

I decided to postulate that for two different (sets of) positions {S_1} and {S_2}, the player’s distributions {P_1,P_2} would be independent, and similarly {Q_1,Q_2} for the computer. This makes the joint distributions {P} and {Q} on {S_1 \cup S_2} satisfy {P = P_1 \times P_2} and {Q = Q_1 \times Q_2}. So that I could group game turns as I wished, I wanted the distance measure to be additive, namely

\displaystyle  d(P,Q) = d(P_1 \times P_2, Q_1 \times Q_2) = d(P_1,Q_1) + d(P_2,Q_2).

The first distance measure I considered, called Kullback-Leibler (K-L) divergence, is defined (on discrete domains {X}) by

\displaystyle  \kappa(P || Q) = \sum_{x \in X}P(x)\ln\frac{P(x)}{Q(x)}.

Its Wikipedia page says straight out that {\kappa} is additive. Great, I thought.

Unfortunately, K-L is not symmetric, and more of concern to me, {\kappa} approaches {+\infty} whenever there are events {x} for which {Q(x)} is tiny but {P(x)} is not. In chess, such events would be moves the computer recognizes as bad but that players still fall into, or are tempted by. This was a concern because chess positions can have many bad moves, so that the “tail” of the move distribution could distort the value of {\kappa}. I could switch around {P} and {Q} to avoid this, but then reasonable moves shunned by players would cause other distortion.

Is Jensen-Shannon Divergence Additive?

Applications employing distributional divergence measures were new to me, but it so happened that my department’s Distinguished Alumni Speaker that month knew something about them. After hearing my issues, he—I won’t name the “guilty party,” though I already did—suggested using Jensen-Shannon (J-S) divergence instead. J-S reduces this distortion by employing the interpolated distribution {R = \frac{1}{2}P + \frac{1}{2}Q}. Then it is defined from two invocations of K-L by

\displaystyle  \eta(P,Q) = \frac{1}{2}\kappa(P || R) + \frac{1}{2}\kappa(Q || R).

This always gives finite values, and is symmetric—hence the use of comma not {||}. Analogous to how the sum of squared differences, which is obviously additive on product vectors, is the square of the Euclidean metric, {\eta} is also the square of a metric. All this, plus the absence of contrary information, plus the frequent words “J-S is a symmetrized and smoothed version of K-L,” naturally made me assume that {\eta} was additive. Grateful for the tip, I happily started on the machinery to apply it for chess.

A week later I started drafting a paper describing my concept and model, and decided it would be good to include a proof that J-S divergence is additive. Then, only then, is when I discovered with an electric shock:

It isn’t.

I’ll leave the task of actually constructing counterexamples to the reader, but here’s an intuition. It uses a generalizing trick that reminds me of one by Bob Vaughan that we covered last July. For any {\lambda} put {R' = \lambda P + (1 - \lambda)Q}, and then define

\displaystyle  \eta'_{\lambda}(P,Q) = \lambda \kappa(P||R') + (1 - \lambda)\kappa(Q||R').

A little reflection may convince you that this cannot be additive for all {\lambda}. Hence its being additive for {\lambda = \frac{1}{2}}, which yields {\eta}, would be an “accident.” Finally thinking how {P} and {Q} themselves can give-and-go with {\lambda} gives the inkling that the accident doesn’t happen.

How Clear in the Literature?

I can put this in the form of a “blog-beg” (sometimes called a bleg):

Can you find an easily-accessed source that says clearly that the basic Jensen-Shannon divergence is not additive?

As of this writing, the Wikipedia page on J-S still does have such a statement. Adding one yourself would be cheating. In 2006 I did not find one elsewhere, even in a couple of texts. My one-hour trial by Google when I first drafted this post last summer found one paper in 2008 titled “Nonextensive Generalizations of the Jensen-Shannon Divergence.” This clued me in that nonextensive is the antonym of additive. So the authors’ generalizations are not additive, but what about the original {\eta}?

Another paper I found had the promising title “Properties of Classical and Quantum Jensen-Shannon Divergence,” and even more, its first author and I have Harry Burhman as a common coauthor. It defines J-S with a bang in the opening sentence, states some generalizations of J-S, and (still on page 1) says the magic word:

Shannon entropy is additive in the sense that the entropy of independent random variables, defined as the entropy of their joint distribution, is the sum of their individual entropies. (emphasis in original)

But the next sentence brings up the different topic of Rényi entropy, and after a mention of “non-extensive (i.e. nonadditive) generalizations” of J-S it goes into quantum.

Another paper picks up the thread in its title, “Nonextensive Generalizations of the Jensen-Shannon Divergence.” The point of the first two words must be that the original J-S is additive, yes? It’s the generalizations that are non-additive. Right? The paper’s abstract says it builds something called the Jensen-Tsallis {q}-difference, “which is a nonextensive generalization of the JSD.” So the original J-S is extensive, then? After defining additivity, it brings up the generalizations in which

… the additivity property is abandoned, yielding the so-called nonextensive entropies.

The next paragraph introduces K-L and J-S, but doesn’t tell me whether the “abandoned” property was ever there. It seems that this simple knowledge is presumed, but how might a bright young graduate student—or an old one—find it in the first place?

Open Problems

Can you give some more examples of “graduate-student traps”? Ones that are helpful to know?

Is Jensen-Shannon divergence close to being additive, in some useful sense? This actually strikes me as a non-trappy, research-worthy question.

Advances on Group Isomorphism

May 11, 2013


Finally progress on this annoying problem

DavidRosenbaum

David Rosenbaum is right now the world expert on one of my favorite problems, group isomorphism. He is a third-year PhD student at the University of Washington in Seattle under Paul Beame, and has been visiting MIT this year to work with his other advisor, our familiar friend Aram Harrow. He presented a paper on this at SODA 2013, and recently posted a successor paper to the ArXiv. He grew up in the Portland area where he won prizes in scholastic chess tournaments.

Today I want to talk about his work, which not only advances our understanding of this problem, but also makes progress on other ones.

Read more…

A Most Perplexing Mystery

May 6, 2013


The discrete log and the factoring problem

Unknown

Antoine Joux is a crypto expert at Versailles Saint-Quentin-en-Yvelines University. He is also one of the crypto experts at CryptoExperts, having joined this startup company last November. His work is featured in all three of the company’s current top news items, though the asymptotic breakthrough on the exponent of finding discrete logs in small-characteristic fields which we covered last month is not among them. In its place are concrete results on two fields of medium characteristic (between a million and a billion) whose elements have bit-size 1,175 and 1,425. The news release on this concludes (emphasis in original):

[We] recommend to all cryptographic users to stop using medium prime fields.

Today I want to talk about a mystery, which I find the most puzzling problem in all of complexity theory, but which Ken thinks is “only” a sign of youth of the field.
Read more…

A Trillion Dollar Math Trick

May 2, 2013


How linear algebra can make databases go really fast

Unknown

Mike Stonebraker is one of the world’s leading expert on database technology. He started in academe at Berkeley, is now again in academe at MIT, and has launched a raft of successful companies. He is currently the co-founder and chief technology officer of at least three startups in the database area. One is called “Data Tamer” and is a joint venture with researchers from QCRI—Qatar Computing Research Institute—see this release.

Today I would like to talk about his presentation at the TTI Vanguard meeting on “Ginormous Systems” in DC. His talk was on “Seven Big Data Megatrends.”

By the way the word “Ginormous” is a real word–see here for the formal definition. I initially thought the Vanguard organizers had made up the word, but it is real. It should be obvious that Ginormous means large, actually really Large. This Vanguard meeting was dedicated to Ginormous systems of all kinds: from huge data centers, to city-wide systems, to supercomputers, and much more.

In Mike’s wonderful talk he made seven points about the past, present, and the future of database technology. He has a great track record, so likely he is mostly right on his guesses. One of his predictions was about a way of re-organizing databases that has several remarkable properties:

  • It speeds up database operations 50x. That is to say, on typical queries—ones that companies actually do—it is fifty times faster than classical database implementations. As a theorist we like speedups, especially asymptotic ones. But 50x is pretty cool. That is enough to change a query from an hour to a minute.
  • It is not a new idea. But the time is finally right, and Mike predicts that future databases will use this method.
  • It is an idea that no one seems to know who invented it. I asked Mike, I asked other experts at the conference, and all shrugged and said effectively: “I have no idea.” Curious.

Let’s look quickly at the way databases work, and then consider the trick.

Some Background

Modern databases store records—lots of them, usually on disks. A record is a fixed-size vector of information. The vector is divided into fields, where a field stores a type of information. An example is:

\displaystyle  [name, address, home\text{-}number, cell\text{-}number, age, \dots ].

Thus the first field is the person’s name, the next the address, and so on.

In a sense the data is really stored in a table—or an array if you wish to be mathematical—call it {D} for data. The rows contain each record, and the columns store the fields.

The issue is how the array is stored on the disk. Each record is stored one after the other on the disk. The records are stored as

\displaystyle  R_{1},R_{2},\dots,R_{N}.

Here each {R_{i}} is the {i^{th}} row.

This is a reasonable method, it puts each record together, and allows fast access of all of the records. Thus, a query can scan over all the records by reading the disk one track at a time. This is not a bad way to use a disk-like device.

Mike points out that all the classic database systems—well at least most—store records in this manner. Their code, which also is huge (if not ginormous) is tuned to handle data that is stored in this manner. Let’s call it the “record ordered method” (ROM). As a mathematical idea it is just storing the array in row-major order. Not only is this a perfectly fine way to organize the data, and to store the array, it respects principles that go back to COBOL in the 1950′s: Each data object should be conceptually and physically together.

But there is a better way.

The Trick

The trick to the 50x speedup is based on the deep, advanced, complex operation that we in math call the transpose of a matrix. Just kidding. It is based on the simple idea that instead of storing the matrix {D} we store the matrix {D^{\intercal}}. Recall {D^{\intercal}} is just the matrix defined by

\displaystyle  D^{\intercal}(j,i) = D(i,j).

Let’s call this the column ordered method: COM. Now the data on the disk contains

\displaystyle  C_{1},c_{2},\dots,C_{M}.

Here each {C_{j}} is the {j^{th}} column.

So why is this method so much faster than the ROM? The answer is how the data is accessed by the queries. The data is read much more than it is written, so the key is to speed up the ability to read the data. But the critical insight is this:

A query is likely to use only a few columns.

For example, suppose the query is:

Select all the records with age in the range [21,31] and cell phones with area code 404.

Then the query needs only to look at two columns. All the other fields are completely un-needed.

Now suppose the records have a hundred fields. Since the query only looks at two fields there is a huge speedup. Then the speedup is {2:100} roughly. In the COM the database algorithm only reads the data that it needs to use to answer the query. In the ROM method it reads all the data and that tremendously slows down the query. Note, things can be even worse, since the size of fields can vary widely. So the true speedup depends on the ratio of

\displaystyle  \frac{ \text{number bits used in query}}{\text{ number of bits in the whole record}}.

Clearly if a record has even one large field that is not used in the query, the speedup could be very large.

How did people not realize this simple idea: replace the table {D} by its transpose {D^{\intercal}}? Well they did not actually miss it, but its power was not realized until relatively recently.

Whose Trick?

As I stated earlier no one seems to be able to say who exactly discovered the COM. Maybe as a default we could call it the Gauss Database Method, since most things are named for him. I did track down a system called TAXIR that was essentially a COM storage system with focus on information-retrieval in biology in 1969. The paper describing it is by George Estabrook and Robert Brill. Maybe they invented it. Perhaps their focus on biology made it hard for those in databases to notice their work? Especially years ago before powerful on-line search engines. Perhaps.

Ken adds that in a textbook used years ago for Buffalo’s course on programming-language concepts, the COM idea was called “parallel arrays” and was frowned upon. The main reason given was that this structure was hard to maintain, as a single off-by-one indexing error in one array could damage the entire set of records. However, a high-level system can maintain the data in-sync, while modern machine architectures increase the reward for keeping just the data you need in caches.

Open Problems

Okay, maybe the trick is not worth a trillion dollars. But the total amount invested yearly in data systems suggests that the column idea could over the next few years be worth quite a few dollars.

A simple thought: Is the column method the best way to store records? Can we in theory prove it is best in some sense, or is there an even better method? So forget the million-dollar Clay prizes and go after the real money.

Happy Birthday, Kurt Gödel

April 28, 2013


Wang on Gödel, and Gödel on Wang

WangCover
source, with more Wang quotes

Hao Wang was a logician who made many important contributions to
mathematics and especially logic. His creation of the now famous tiling problem was of great importance. He did seminal work on mechanical mathematics, getting in 1983 the first Milestone Prize for Automated Theorem-Proving. Perhaps one of his best “results” was being the PhD advisor to Stephen Cook, Shimon Even, Stål Aanderaa, Joyce Friedman, and a recent invention-prize co-winner whom we mention below.

Today Ken and I wish to point out that it is Kurt Gödel’s birthday. Read more…

Sex, Lies, And Quantum Computers

April 27, 2013


Okay, no sex, but a discussion about quantum computers.

images

Steven Soderbergh directed the famous movie: Sex, Lies, and Videotape. This 1989 movie won the Palme d’Or at the 1989 Cannes Film Festival, and is now in the United States Library of Congress’ National Film Registry for being “culturally, historically, or aesthetically significant.”

Today I want to explore claims about quantum computation.
Read more…

Can We Prove Better Independence Theorems?

April 24, 2013


An approach to independence with more complexity dependence

Pelupessy

Florian Pelupessy recently defended his PhD thesis at the University of Ghent in Belgium. In joint work with Harvey Friedman, he found new short proofs for two independence results from Peano Arithmetic. One result is the famous “natural” Ramsey-theoretic independence result proved by Jeff Paris and Leo Harrington in 1977, while the other is a different Ramsey-type result obtained in 2010 by Friedman. Pelupessy also maintains a page with links on “phase transitions” in proof theory—meaning cases where a slight change in values of parameters makes a statement go from being provable to independent.

Today I want to talk about whether we can prove that some of our open problems are independent of Peano Arithmetic or other theories. Read more…

Follow

Get every new post delivered to your Inbox.

Join 561 other followers