Tamal Biswas has been my graduate partner on my research model of decision-making at chess for over four years. He has I believe solved the puzzle of how best to incorporate *depth* into the model. This connects to ideas of the inherent *difficulty* of decisions, *levels* of thinking, and *stopping rules* by which to convert thought into action.

Today I describe his work on the model and the surprise of being able to distinguish skill solely on cases where people make mistakes. This is shown in two neat animations, one using data from the Stockfish 6 chess program, the other with the Komodo 9 program, whose elements are explained below.

Tamal presented this work on the Doctoral Consortium day of the 2015 Algorithmic Decision Theory conference in Lexington, Kentucky. The conference was organized and hosted by Judy Goldsmith, whom I have known since our undergraduate days at Princeton. Our full paper, “Measuring Level-K Reasoning, Satisficing, and Human Error in Game-Play Data,” will be presented at the 2015 IEEE Conference on Machine Learning and its Applications in Miami. Tamal presented our predecessor paper, “Quantifying Depth and Complexity of Thinking and Knowledge,” at the 2015 International Conference on Agents and AI in Lisbon last January.

Although we have blogged about the chess research several times before, I haven’t yet described details of my model here. After doing so we’ll see why depth has been tricky to incorporate and what our new discoveries mean.

The only connection to chess is that chess has alternating turns involving decisions whose options can be given numerical utility values that are objective and reliable but difficult for the players to discern. The numbers come from powerful chess programs (commonly called *engines*) whose rated playing skill long surpasses that of human players and is arguably measurably close to perfection. The Elo rating system is scaled so that a difference of 100 rating points derives from and predicts a 64%-to-36% points margin for the stronger player. The World Chess Federation lists over 20,000 human players above the 2200 threshold for “master” but only 45 players above 2700 and just four above 2800 including world champion Magnus Carlsen at 2850—while the best engines are rated above 3200 and my model currently suggests a ceiling below 3500.

The move values are updated by the program in rounds of increasing search depth often reaching and beyond, by which they have most often stabilized. The highest option—or the first listed move in case of tied values—is the engine’s *best move* in the search, and its final value is the overall position value.

The numbers come from chess factors—beginning with basic values for the pieces such as pawn = 1, knight = 3, and so on—but they are governed by a powerful regularity. Position values are centered on 0.00 meaning “even chances,” positive for advantage to the side to move, and negative for disadvantage. The regularity is that when the average score (counting 1 per win and 0.5 per draw) achieved by players from positions of value is plotted against , the result is almost perfectly a logistic curve

The factor offsets the scale of the engine’s evaluation function—for a simple instance, whether it counts a queen as 9 or 10. Curiously the curve flattens only a little for two players of lower Elo rating. When there is a substantial *difference* in rating between two players, however, it does a strong horizontal shift:

A 2012 paper by Amir Ban not only observes this relationship (also separating out wins and draws) but argues that generating evaluations within the search that follow it optimizes the strength of the engine. We henceforth argue that the utility values have organic import beyond chess, and that the problem addressed by our model is a general one of “converting utilities into probabilities.”

By the way: Ban is famous for being a co-creator of both the Deep Junior chess program and the USB flash drive. That is right: the flash drive that we all use, every day, to store and transfer information—probably a bit more impactful than anything to do with chess, but amazing that he did both.

The model uses the move values together with parameters denoting skills of the player to generate probabilities for each legal move . Up to now I’ve used two parameters called for *sensitivity* and for *consistency*. They govern a term

where is a scaling adjustment on the raw difference . Lower and higher both reduce and ultimately the probability of an inferior move.

Chess has a broad divide between “positional” and “tactical” skill. I’ve regarded as measuring positional skill via the ability to discern small differences in value between moves, and as avoidance of large errors which crop up most often in tactical games. Neither uses any chess-specific features. Without introducing any dependence on chess, I’ve desired to enrich the model with other player parameters. Chief among them has been modeling *depth of thinking* but it is tricky.

My first idea was to model depth as a random variable with a player-specific distribution. As a player I’ve had the experience of sometimes not seeing two moves ahead and other times seeing “everything.” A discrete distribution specified by a mean for a player’s “habitual depth” and a shape parameter seemed to fit the bill. The basic idea was to use the engine’s values for at all depths up to the limit depth of the search in an expression of the form

A related idea was to compute probabilities from the values at each depth and finally set .

Doing so changed the regression from one over to one over and possibly . Whereas the former almost always has one clear minimum, the latter proved to be pockmarked with local minima having no global drift we could discern. Indeed, for many initial points, would drift toward with wild and values regardless of the skill of the games being analyzed. Perhaps this was caused by noisy values at the very lowest depths giving a good random match. I stayed with the two-parameter model in my official use and followed other priorities.

Looking back, I think my idea was wrong because *depth* is not primarily a human property. We don’t first (randomly) choose each time whether to be deep or shallow—to be a Fischer or a “fish” for a move. Instead, aspects of the position influence how deep we go before being *satisfied* enough to *stop*. One aspect which I had conceived within my scheme involves moves whose values *swing* as the depth increases.

A move involving a sacrifice might look poor until quite far into the search, and then “swing up” to become the best move when its latent value emerges. A “trap” set by the opponent might look tempting enough to be valued best at first even by the engines, but they will eventually “swing down” the value as the point of the trap is plumbed. The sublime trap that did much to win the 2008 world championship match for Viswanathan Anand over Vladimir Kramnik shows both phenomena:

Left: Position before Kramnik’s 29.Nxd4. Right: Anand’s winning 34…Nxe3! riposte. Below: Stockfish 6 values x100 at depths 1–19. |

Kramnik’s 29.Nxd4 capture might be dismissed by a beginner since Black’s queen can just take the knight, but a seasoned player will see the followup 30.Rd1 skewering Black’s queen and knight on d7. Evidently this looked good to Kramnik through 30…Nf6 31.Rxd4 Nxg4 32.Rd7+ Kf6 33.Rxb7 Rc1+ 34.Bf1 coming out a pawn ahead, and Stockfish turns from negative to positive throughout depths 9–14. But then it sees the shocking 34…Ne3!!, when after 35.fxe3 fxe3 Black’s passed pawn is unstoppable. Kramnik never saw these moves coming and resigned then and there.

In my old view, Kramnik pulled a number between 9 and 14 out of his thinking cap and paid the piper to Anand who had rolled 16 or higher at his previous turn. Based on the good fits to my two-parameter model, the reality that most positions are clear-cut enough even for beginners to find the best move over 40% of the time, and reasoning about randomness, I pegged the phenomenon as secondary and likely offset by cases where the played move “swung up.”

In Tamal’s view the whole plateau of 9–14 put friction on Kramnik’s search. Once Kramnik was satisfied the advantage was holding, he stopped and played the fateful capture. Of various ways to quantify “swing,” Tamal chose one that emphasizes plateaus. Then he showed larger effects than I ever expected. Between the 0.0–1.0 and 4.0–5.0 ranges of “swing up” for the best move in his measure, the frequency of 2700-level players finding it plummets from 70% to 30%. This cannot be ignored.

This result presaged the effectiveness of several other measures formulated in our ICAART 2015 paper to represent the *complexity* and *difficulty* of a decision. We introduced some ideas relating to test-taking in an earlier post but now we’ve settled on metrics. The most distinctive of Tamal’s ideas blends our investigation of *depth* with Herbert Simon’s notion of *satisficing* going back to the years after World War II.

Simon coined *satisficing* from “satisfy” and “suffice” and contrasted it to *optimizing* during a search. It is often treated as a decision policy of meeting needs and constraints, but Simon also had in mind the kind of limitation from *bounded rationality* (another term he coined) that arises in searches when information is hard to find and values are hard to discern. In a zero-sum setting like chess, except for cases where one is so far ahead that safety trumps maximizing, any return short of optimum indicates an objective *mistake*. We address *why* players stop thinking when their move is (or is not) a mistake. We use the search depth to indicate both time and effectiveness, gauging the latter by comparing the values given by the engine at depth and at the supreme depth .

Not all inferior moves have negative swing—their inferiority might lessen with —and not all moves of negative swing were valued best at some lower depths. For those that were valued best we can identify the depth at which the move is exposed as an error by the engine: in our example above. We judge that the player failed to look deeper than and call the *depth of satisficing* for that move.

We can extend the notion to include the many negative-swing cases where the played move does not actually cross with the ultimate best move . For each , plot the average over selected positions of and (note by definition, but if some other move is better at depth ). Over all positions the curves tend to be relatively close at low depths, but over positions with played move of positive swing the curve for stays significantly apart from that for , roughly paralleling it. This explains our persistent observation at all skill levels that over the negative-swing moves the curves *do cross* at some depth . Indeed, figures in our paper show that (green line below) is often significantly inferior to (red) at low depths. The crossover depth of the averages becomes the *depth of satisficing* for the aggregate.

Using UB’s Center for Computational Research (CCR) to run the engines Stockfish 6 and Komodo 9, we have analyzed every published game played in 2010–2014 in which both players were within 10 Elo points of a century or half-century mark (widened to 15 or 20 for Elo 1300 to 1650, 2750, and 2800 to keep up the sample size). Combining Elo 1300 and 1350, 1400 and 1450, etc., shows that the crossover depth grows steadily with rating in both the GIF for Stockfish (left) and the GIF for Komodo (right).

There we have it: a strong correlate of skill derived **only** from moves where players **erred**. The inference is that better players make mistakes whose flaws require more time and depth to expose, as measured objectively by the engines. A further surprise is that the satisficing depths go clear down near zero for novice players—that we got anything regular from values at depths below 4 (which many engines omit) is noteworthy. That the best players’ figures barely poke above depth 10—which computers reach in milliseconds—is also sobering.

Chess is played with a time budget, much as one has when taking a standardized test. The first budget is almost always for turns 1–40, and although exhausting it before move 40 loses the game immediately, players often use the bulk of it by move 30 or even 25. The following moves are then played under “time pressure.” Segregating turns 9–25 from 26–40 (the opening turns 1–8 are skipped in the analysis) shows a clear effect of free time enhancing the depth and time pressure lowering it:

All of these games—this data set has over 1.08 million analyzed moves from over 17,500 games—come from recent real competitions, no simulations or subject waivers needed, in which skill levels are reliably known via the Elo system. Data this large, in a model that promotes theoretical formulations for practical phenomena, should provide a good standard of comparison for other applications. Tamal and I have many further ideas to pursue.

What parallels do you see between the thought processes revealed in chess and decision behaviors in other walks of life?

]]>

* The math of “The Curious Incident of the Dog in the Night-Time” *

Mark Haddon wrote the book, *The Curious Incident of the Dog in the Night-Time*, which was published in 2003. It is about an autistic 15 year-old boy, who is a math savant, and who solves a mystery, in spite of his limitations in relating to people.

Today I want to comment on a minor historical inversion at the end of both the book and the current play that is based on Haddon’s book.

I had the great pleasure to see the play recently and found it an amazing experience. The story is told solely from the point of view of an autistic boy, named Christopher Boone. Amazon says:

Christopher John Francis Boone knows all the countries of the world and their capitals and every prime number up to 7,057. He relates well to animals but has no understanding of human emotions. He cannot stand to be touched. And he detests the color yellow.

I re-read the book days before seeing the play, and was unable to even imagine how the play could capture the feel of the book. But they did it. A New York Times review says:

Such a state of being is conjured with dazzling effectiveness in “The Curious Incident of the Dog in the Night-Time,” which opened on Sunday night at the Ethel Barrymore Theater. Adapted by Simon Stephens from Mark Haddon’s best-selling 2003 novel about an autistic boy’s coming-of-age, this is one of the most fully immersive works ever to wallop Broadway.

It was definitely a wallop. Both the book and the play end with a nice geometric problem. In both the answer, which is a proof, is left out of the main part. It is detailed in the book in an appendix; in the play it is delivered by Christopher after all the curtain calls. An “appendix” to a play—what a clever idea.

So let’s start by stating the geometric problem from both the book and play.

Prove the following: A triangle with sides that can be written in the form , , and , where , is a right triangle.

The proof starts by showing that is the longest side; this uses . Then it proves that

It then states that by the Pythagorean Theorem the triangle is a right one.

But this is inverted.

The famous Pythagorean theorem states:

Theorem:Let the sides of a right triangle be with the largest. Then .

The proof of the problem from the book uses **not** this theorem—this is the inversion. Rather it uses the converse: For any triangle with sides , if , then it is a right triangle.

Happily this converse of the Pythagorean theorem is also a theorem. Indeed Euclid already had proved it. I must admit that I was not sure it was a theorem.

At the play I heard the problem for the first time, since when I read the book I skipped the appendix. As Christopher proved the theorem for the audience, I was almost ready to raise my hand—as if it were a seminar talk—and ask: isn’t there a potential issue with the proof, since it relies on the converse not the actual Pythagorean Theorem? Then I realized this wasn’t a lecture hall, and left the theater quietly.

The proof of the converse is not hard, but it is definitely a different theorem. What’s curious, however, is that its proof uses the original Pythagrean theorem. Here is Euclid’s proof as relayed by Wikipedia from this source:

Let be a triangle with side lengths , , and , with . Construct a second triangle with sides of length and containing a right angle. By the Pythagorean theorem, it follows that the hypotenuse of this triangle has length , the same as the hypotenuse of the first triangle. Since both triangles’ sides are the same lengths , and , the triangles are congruent and must have the same angles. Therefore, the angle between the side of lengths and in the original triangle is a right angle.

So here we have a proof of the () direction of an equivalence whose proof uses the () direction. How common is that?

Did you know that the Pythagorean Theorem was an “if and only if theorem?” I did not. Are there other notable cases of equivalences where the proof from the “Book” of the converse direction uses the forward direction?

]]>

*A breakthrough result shows the power of “almost”*

Cropped from
*Quanta Magazine* source

Terry Tao has done it again. In two beautiful papers with modest titles, he has evidently proved the famous Discrepancy Conjecture (DC) of Paul Erdős. This emerged from discussion of his two earlier posts this month on his blog. They and his 9/18 announcement post re-create much of the content of the papers.

Today we wish to present just the statement of his new result in a vivid manner and some meta-observations on how he arrived at it.

Erdős originally offered a $500 prize for a solution. If we date that offer to an article he wrote in 1957, then it would be about $4,250 in today’s money, still a lot from one person. In 2013 we mused on whether a solution could be applied to win higher prizes. Last year there was an exhaustive proof of a partial case, but it was by computer brute force.

What worked now were the twin insights that DC follows from another conjecture and that proving an “almost” case of that conjecture could be enough to prove DC. We wonder whether finding new “almost” cases of our big conjectures in complexity theory will help win big prizes—or at least “almost” prizes.

A frog named Gorf sits on the shore of a large pond. Before him in the water are a stretch of lily pads. On or above each pad is a beetle or a fly. Gorf is hungry.

Gorf doesn’t worry about reaching the lily pads. He is a powerful jumper—he can jump any number of pads at once. Gorf’s problem is that once he starts jumping he can’t stop: he will keep jumping to every -th pad.

Gorf needs a *balanced* diet. If he keeps eating more bugs than flies he will get a tummyache; if he eats too many flies, well those wings—you know what happens if you eat too much fiber. He needs there to be a constant so that as he eats and eats, the absolute difference between the numbers of beetles and flies consumed never exceeds .

Finally, Gorf is a worrywort. He is afraid the wrong choice of jumping distance could cause indigestion. So unless the sequence is such that *every* initial jump will lead to a -balanced diet, he will stay on the shore and starve. Stated as a *problem*, what Erdős asked was this:

Does there exist an infinite sequence of beetles and flies so that Gorf won’t starve?

Erdős conjectured that there isn’t. Well OK, Erdős didn’t talk about frogs and bugs and flies. But he could have. Anyway he was right. Here is how he thought of it.

Each lily pad contains a number. Each fly is and each beetle is . Or it could be the opposite; the problem is symmetric. Let be the number on lily pad . Given the sequence , what we care about is whether there is a such that for all and ,

So as the frog jumps to every -th pad and eats he is adding to his running total . The question is: Can he jump from lily pad to lily pad and keep the number bounded—that is, keep it always between and , for some and all ? If he can, then say the sequence is *good*.

Which sequences are good? Of course the answer depends on the sequence. If the pads are labeled

then clearly he will always get a larger and larger number. Note, this will happen no matter what jump he does.

Next, consider the sequence of lily pads

In this case Gorf can just jump one pad at a time () and keep his count at or . However, if he jumps 2, he only eats beetles. Since Gorf is worried that he would overhop the first lily pad and never be able to stop jumping two-by-two, he won’t budge. So this is still not good for **all** , which is what Gorf needs.

Gorf reads a book on the Probabilistic Method. He wonders if he can trust that Nature is random—that the sequence of bugs and flies he encounters will conform to a normal distribution as grows, no matter what he chooses. So he should just take a random initial leap of faith and eat and be merry. But he realizes that the *discrepancy*—the difference between and zero—will expect to vary as , not constant. This is the wrong way around for the method to imply the *existence* of a good sequence. Just the thought of this gives Gorf indigestion.

Hope comes from the fact that we can construct sequences that give vastly lower discrepancy than random ones. Given , divide out all factors of 3. The resulting number will either be congruent to 1 mod 3, in which case , or congruent to 2, whereupon . Then with the absolute partial sums stay within . But this is not bounded by a constant, so not good enough for Gorf.

This last sequence is also multiplicative: . Until Tao’s result, no one had even ruled out the existence of a good sequence that is multiplicative. Multiplicative sequences can also be generalized by allowing values on—or at least within—the complex unit circle. For instance, let be on the circle and let be a function giving , such as the number of prime factors of including multiplicity (and ). Then the sequence is multiplicative and stays on the circle. Notions related to discrepancy can be extended to these sequences, perhaps summing just the real parts.

There is an easy way to make a sequence for Gorf that any diet doctor would approve: Leave some of the lily pads empty. That is, allow as a sequence value.

In fact, just repeat “beetle-fly-empty” over and over again: (regarding as ). If the initial jump is or or etc., Gorf eats beetle-fly-beetle-fly…, as balanced as can be. If it is or or , Gorf is equally happy: it’s fly-beetle-fly-beetle…

The problem for Gorf is that if is a multiple of 3 then he never eats. The problem for us is that the balancing is trivial. If you try filling in the zeros, you just have DC back again for cases where is a multiple of 3; if you do this balancing recursively, you get the sequence in the last section which doesn’t quite work.

Still, this showed that if the DC is true, it is true *because* Gorf is being “force-fed” at every pad. Exactly why that detail should matter may still be unclear.

DC is also true in a uniform sense: for every there exists such that for *every* sequence of length , some jump makes Gorf’s absolute partial sums exceed before reaching lily pad . This follows from a famous lemma of Dénes Kőnig: if every branch of a subtree of the infinite binary tree is finite, then the whole subtree is finite. This does not, however, prevent the existence of an infinite sequence such that for every there exists making for all . Such a sequence is discussed in Remark 1.13 of Tao’s paper and is defined recursively by and for , , and , by

Another delicate point is that if Gorf were allowed to start jumping at any initial pad —say with relatively prime to —then it was known that no sequence can meet the requirement of being good for all **and** all . That is, any sequence has arithmetical progressions of unbounded discrepancy. Starting at zero—on the shore—is important to the problem.

Tao listed basically the same two mod-3 examples in his first post just before reporting the first bombshell of his breakthrough, which is that DC is implied by another conjecture.

Quanta Magazine src1, Wired src2 |

The other conjecture only needs the idea of a Dirichlet character to state. This is a completely multiplicative function that has some integer period , for which is relatively prime to . The jumping-off point was a conjecture of Sarvadaman Chowla that for the sequence above and all , the magnitude of

is asymptotically . Peter Elliott generalized it to include Dirichlet characters but used an asymptotic condition that was initially too strong. Building on work earlier this year with Kaisa Matomaki and Maksim Radziwill, Tao first repaired Elliott’s conjecture and then made it concrete. We’ll call it TEC for Tao’s Elliott conjecture; we have changed his variable names to try to connect it more to the uniform version of DC above:

Conjecture (TEC).For all there is such that for all there is such that for all , and all multiplicative sequences into the unit complex disk:ifall Dirichlet characters of period at most give that the real part ofhas absolute value for all with ,

thenfor all ,

There are differences in the connection, most notably that the conclusion bounds a sum over all rather than show it to be unbounded. However, what shakes out in the analysis is that keeping bounded this kind of sum of products over all possible -jumps between lily pads (which can have complex life forms not just beetles and flies) enables you to avoid losing a handle on the imbalances from the jumping that Gorf actually does. The two kinds of jumping are set up to be related by a Fourier transform, which characteristically relates constancy in one signal to waviness and periodicity in the other. The connection with products of the form is analyzed by an old trick which Tao once discussed here. There is finally an important use of Andrew Granville’s notion of *pretending*, which we once briefly covered in connection with EDC here, and which became a focal point in much public PolyMath work credited liberally by Tao.

Well we think this is what happens when one looks at the details—we always say read the papers for the details, though this time we haven’t had time to absorb them ourselves. But we will say one final thing: Tao does **not** prove TEC. No. What he does instead is come up with a way to say it holds “on average” with respect to a separate quantity such that the sum limits to a range where is unbounded, is put in the denominator of the summand, and the right-hand side’s is replaced by as . It takes a lifetime of experience with the relevant tools of analysis to recognize when to apply this kind of transformation. But its success might stimulate us to think of more creative ways to use averaging arguments and average-case analysis in our field.

There is still much to do and digest after this breakthrough. Is any nice constructive bound on in terms of implied by the analysis? Recall that gave last year. Gil Kalai and Tao have exchanged some further ideas beginning here.

One thing for sure, the *problem* of Erdős discrepancy is still a problem. Yogi Berra who passed away yesterday said “it ain’t over ’til it’s over,” and it isn’t over.

[fixed EDC statement with before , added example to uniformity statement]

]]>

Composite of src1, src2 |

Gregory Valiant and Paul Valiant are top researchers who are not unrelated to each other. Families like the Valiants and Blums could be a subject for another post—or how to distinguish from those who are unrelated.

Today Ken and I wish to talk about a wonderful paper of theirs, “An Automatic Inequality Prover and Instance Optimal Identity Testing.”

I really like this paper, which I saw at FOCS 2014. I am less keen on the title—it suggests to me that they are working on some type of AI area. The phrase “Automatic Inequality Prover” sounds like an algorithm that takes in an inequality and outputs a proof that it is correct. Of course it does this only when the inequality is true. This is not what they do.

Ken chimes in: The paper has a neat mix of other ideas. There are natural cases of extremely succinct descriptions of strings that would take double exponential time to write out. It is notable that the formal system underlying their derivation of inequalities does *not* yield exponential complexity (let along undecidability) but stays within linear programming—do -completeness results lurk? The instance-optimality notion is a hybrid of uniform and nonuniform lower bounds in a hybrid white-box/black-box setting.

They study the problem of telling one distribution from another. Suppose that and are two discrete distributions:

The question is how many independent samples are needed to distinguish the two cases: (i) from (ii) .

The answer is too many, since and could be very close. So they replace (ii) by the weaker condition that and are far apart in the metric,

Now they are able to give essentially optimal answers to the question. There is a twist on the notion of optimal. Given the distribution explicitly as an -tuple, their algorithm is allowed to run within factors that depend on . Their optimal bounds do not even separate out as a product , though they give close upper bounds that do. In fact, they technically have the form where the “” part governs an adjustment to the norm of for the upper bound. The lower bound is expressed by the statement:

Theorem 1There is a global constant , , such that foralldistributions all sufficiently small (possibly depending on ), and every tester given white-box knowledge of but limited to samples of size from the unknown distribution there exists a distribution that is either or has , such that the probability over of giving the wrong answer is at least .

We can “almost” jump ahead of the choice of in the quantifier order. Trivially we can’t: if is the “tester” that always says “distinct” then we need to choose to break it. But if avoids false positives when , then what they build is a distribution such that a random member of impersonates well enough on small samples to make give a false negative. What’s neat is that no uniformity or complexity bound is needed on —it’s a combinatorial argument directly on and the small sample size. The white-box nature of separates their bounds from the higher samples needed by a tester to *learn* , if too were unknown.

See their paper for the details on , discussion of instance-optimality, and delicate issues such as not being able to define-away the by adjusting . (In the paper there are constants rather than .) Their relaxed upper bound is indeed simple:

This improved a previous upper bound that had in the denominator and polylog() factors in the numerator. That is quite a jump from a beautifully tight work of analysis.

What strikes us even more is the “Automatic ” part. In order to analyze their algorithm for deciding the above question on distributions, they are naturally led to consider some “hairy” inequalities—their term. I suspect that Greg and Paul are quite facile at handling almost any inequalities, so for them to say the inequalities were hairy means they were indeed complicated inequalities.

This leads to the neatest part of their paper, in our opinion. They give a complete characterization of a general class of inequalities that extend Cauchy-Schwarz, Hölder’s inequality, and the monotonicity of norms. Here —the traditional letter, not the same as distributions above—means that the norm takes the sum of absolute -th powers and outputs the positive branch of . As they implicitly do, let’s switch to in the norm context.

For example, their expression uses the norm of the distribution . The choice may seem strange, but for the uniform distribution ,

so it relates to long-known bounds involving samples for uniform distribution. The inequalities in their analysis, however, involve other values of the norm power . For , the following inequality for non-negative numbers is equivalent to the norm being monotone decreasing in :

If we raise both sides to the power we see what happens for :

so with , that is , we get the equivalent form

Thus the original form (1) for captures all the needed information, while the form for is a kind of *dual*. This also shows the idea of substituting powers for .

When we have a second non-negative sequence we can give Otto Hölder’s generalization for of Cauchy-Schwarz, which is the case :

We can obtain dual forms here too, and it helps to consider substitutions of the form and similarly for . Doing so, and dividing out the right-hand sides, we obtain the basic forms for and Hölder with :

Let’s just state their main auxiliary result. Specifically, they determine for all length- sequences , , and of rational numbers whether the inequality

holds for all and all non-negative real -vectors and . For example, taking , , , and gives us the Hölder inequality with , that is, Cauchy-Schwartz.

Theorem 2The inequality (3) holds if and only if the left-hand side can be expressed as a finite product of positive powers of instances of the left-hand sides of the basic and Hölder inequalities.

Moreover, there is an algorithm that in time polynomial in and the maximum bits in any , , or will output the terms of such a product if the inequality is true, or a description of a counterexample if it is false.

The proof uses variables standing for the logarithms of sums of the form and takes logs of the inequalities to make linear constraints in these variables with non-negative values. The key idea is that if a derived objective function has a nonzero optimum then its argument (and a neighborhood of it) gives counterexamples, whereas if the optimum is then the *dual* program comes into play to yield the desired product. The bound on is not explicitly represented, but it comes out in counterexamples.

The process is so nicely effective that they are able to represent it by a human-friendly game on grid-points in the plane that is somewhat reminiscent of (and easier than) the solitaire peg-jumping game—well worth a look. They say:

Our characterization is of a non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought through trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will be useful to others, and facilitate analyses like the one here.

In particular, it helped them with inequalities involving that “hairy” number which they used to prove their main results about sampling.

The short answer to why their algorithm runs in polynomial time is that it involves linear programming. There isn’t a tight analogy between their game moves and basic pivot moves of the simplex algorithm, however, because there are cases where the latter takes exponential time. It is unclear to us whether they are using a subcase of linear programming that stops short of being P-complete, or whether this will lead to new P-complete problems involving inequalities. They do have an even more general form with ()-many -sequences and summands for to for any , in place of the elements and for .

Most notable is that the minimum dimension of a counterexample can be *doubly* exponential in the bit-size of the arguments, and yet the algorithm can still describe it. This happens even for as the note: Consider the inequalities

The middle factor is legal since . This is true for but false for any with growing exponentially in . Yet counterexamples have succinct descriptions—as they note, taking makes it false with

There are -many trailing s, and they make large enough for the middle term to overpower the rest. Put another way, the domain formed by their constraints is regular enough that extrema and specific points in their neighborhoods have short descriptions.

We close with a speculation about duality and dimension. The inequalities become equalities only when . The Hölder inequalities become equalities only when is proportional to —that is, they are the same projective point. We wonder if their algorithm can point the way to a useful tradeoff between homogeneity and dimension, which might help tame cases where the dimension is large.

Can you find further applications for their result on inequalities? Can the class of inequalities that can be given an “automatic” treatment be extended further? For one instance we mention the Chebychev sum inequality

which depends critically on the monotone orderings and .

]]>

*Correcting an erratum in our quantum algorithms textbook*

Cropped from source

Paul Bachmann was the first person to use -notation. This was on page 401 of volume 2 of his mammoth four-part text *Analytic Number Theory*, which was published in Germany in 1894. We are unsure, however, whether he defined it correctly.

Today we admit that we got something wrong about -notation in an exercise in our recent textbook, and we ask: what is the best way to fix it?

Bachmann was simply plugging in Leonhard Euler’s estimate of the harmonic sum in

where . He stated, “we find that

if we use the expression to represent a quantity whose order in regard to does not overstep the order of ; whether it really has components of order inside it is left undetermined in the above derivation.”

The big clearly stands for *order*—*Ordnung* in German—but where does he define *Ordnung*? According to the search at the University of Michigan Historical Math Collection page for Bachmann (click the third item, which is mis-labeled), the word *Ordnung* appears on eleven previous pages of volume 2 in several senses. However, the closest I find to a definition is a place on pages 355—356 where he discusses a formula by Adrien-Marie Legendre:

If one says that a quantity is of order when for unboundedly growing the ratio of to for is unboundedly large, whereas for it is unboundedly small, then one can show that Legendre’s expression is not precise enough up to quantities of the order inclusive, rather that among all functions of the form this is only the function .

I will not claim that my translation has enhanced the clarity but I don’t think it has diminished it much either—I think it is -of the actual clarity. Edmund Landau, who also introduced little- and did much to rigorize and popularize both notations, only stated that he found the notation in Bachmann’s book, not the definition.

My dictionary defines *erratum* as *an error in writing or printing*, but Wikipedia defines it as a *correction* of such an error. The latter is clearly meant when one publishes an erratum, while an *errata* page means to indicate both the errors and the fixes. My subtitle “fixing an erratum…” may seem the former—a tony way of saying “fixing an error”—but it’s not. We need to fix the first *erratum* on our own errata page.

The original error appears in the exercises to Chapter 2 of our recent textbook, *Quantum Algorithms via Linear Algebra*. In the book the exercise reads,

Show that a function is bounded by a polynomial in , written , if and only if there is a constant such that for all sufficiently large , .

When I wrote the exercise I had in mind this proof of the direction: Pretend is a real function with that property and consider the integral of from to . We can rewrite this as the integral from to of . The assumption bounds the latter by times the integral of from to . Iterate this times in all until is less than the fixed and finite value underlying the phrase “sufficiently large .” This gives the following bound on the integral:

where is the maximum value of on . Jumping back to being originally defined on integers avoids any potential nastiness about , and the bound on the integral clearly applies to . I figured the direction was immediate since with .

I had intended that time functions are monotone, that is , as holds automatically if means the worst-case running time on inputs of length *at most* . I received one e-mail showing this false when could decrease, and thought adding the word “monotone” fixed the problem. However it does not.

A clever counterexample was sent to us last weekend by Marcelo Arenas and Pedro Bahamondes Walters of the Pontifical Catholic University of Chile in Santiago. They defined a function that stays bounded by but alternates between behaving linearly and suddenly jumping up to meet the parabola again:

We can convey in words the essence of the rigorous proof they provided, technically defining a slightly different function : Consider any point on the parabola. Make a line go northeast at 45 through the points for up to some . The line ends at the point ; put . Now continue the line on a steeper slope back up to meet the parabola at the point

The slope of the segment from to is the ratio , and it is

This shows that the ratio is nowhere near staying bounded by a constant , either as or is made as large as desired, even though . So the direction is wrong.

The motivation that we want to convey is that a polynomial bound is really a *linear* bound as the size of the data doubles. Dick recalls Bob Sedgewick emphasizing this point in his teaching. An old post discussed Bob’s idea of determining running times of real polynomial algorithms empirically by sampling cases where and fitting .

We could try to fix things by using in place of . The bad function above is excluded because it is not for any . Now the forward direction holds because:

However, now we have lost the converse direction, because still allows to wobble enough that it is not Theta-of any one polynomial. We could try to demand that be asymptotic to for some fixed constant (that is, so that as ), but then we lose the forward direction again.

Besides, once we try to be more specific about the asymptotics, we lose the simple motivation a-la Sedgewick. What is the simplest way to preserve it?

Dick recalled a post by Terence Tao on Mikhail Gromov’s theorem characterizing finitely-generated groups of polynomial growth as those having nilpotent subgroups of finite index. The growth is of neighborhoods of radius in the Cayley graphs of the groups with generators . Tao describes a new proof of Gromov’s theorem by Bruce Kleiner, and for simplicity uses the stronger condition for some constant and all . He writes (emphasis his):

In general, polynomial growth does not obviously imply bounded doubling at all scales, but there is a simple pigeonhole argument that gives bounded doubling on

mostscales, and this turns out to be enough to run the argument below. But in order not to deal with the (minor) technicalities arising from exceptional scales in which bounded doubling fails, I will assume bounded doubling at all scales.

Unfortunately this insight is specific to the geometry of the Cayley graphs and does not carry over simply to growth rates—functions of the form above can be made to violate bounded doubling for many long intervals. We wonder instead about restricting to some “natural” class of time functions. Donald Knuth’s seminal 1976 paper on asymptotic notation reminds us of this theorem by Godfrey Hardy:

Theorem 1If and are any functions built up recursively from the ordinary arithmetic operations and the exp and log functions, we have exactly one of the three relations , for some , or

The equivalence in our exercise seems to hold for Hardy’s class of functions—at least the mechanism of the counterexample is excluded. Is the forward direction now easy to prove?

Even so, Hardy’s class might be felt too restrictive to stand for running times of all natural algorithms. For one reason, it excludes the function, which figures in the analysis of algorithms for numerous problems mentioned here. For another, we can readily imagine that natural algorithms could exhibit the kind of stepwise behavior of the counterexample above, when exceptional inputs requiring extra attention are sparsely distributed.

Is there a simple and natural extra condition that makes bounded doubling equivalent to polynomial growth?

More generally, how does one express something that is “morally true” but technically false?

[missing constant C before Tao quote, earlier f “could decrease”, Marcelo Arena -> Arenas]

]]>

Broad Institute source |

Nick Patterson is one of the smartest people I have ever known.

Today I would like to talk about something he once said to me and how it relates to solving open problems.

Nick now works at the Broad Institute on genomic problems—especially large data sets. For decades, he worked as a cryptographer for the British, and then the U.S. He also spent many years with Renaissance Technologies, Jim Simons’s investment hedge fund.

Years ago I consulted regularly at IDA—a think tank for the NSA that was based in Princeton. I cannot tell you what we did, nor what we did not. But I can say we worked on interesting problems in the area of communication. One of the top scientists there was Nick. He was a quite strong chess player, and at tea would often play the other top player at speed chess. I once asked Nick how he would fare against the then world champion Anatoly Karpov. He answered that it would be:

Sack, sack, mate.

I never really believed this. I always thought that he was strong enough to hold out a bit, but perhaps Nick was right. Perhaps it would be over quickly. That Karpov would find a simple, short, demonstration that would wipe Nick out.

But given how smart Nick was I often wondered if he was right when we translate his statement to mathematics:

Can open problems have simple solutions?

Could some open problems fall to: “We know […] and by induction […] we are done”?

Leonhard Euler thought long and hard about possible generalizations of Fermat’s Last Theorem. He was the first to fashion a nearly-correct proof that no cube could be a nontrivial sum of fewer than three cubes. He conjectured that no 4th-power could be a nontrivial sum of fewer than four 4th-powers, no 5th-power a sum of fewer than five 5th-powers, and so on. These cases were open for nearly two centuries, but in 1966, Leon Lander and Thomas Parker used a computer to find

It still took 20 more years for the other shoe to drop on the case:

This was the smallest member of an infinite family of solutions found by Noam Elkies—himself a master chess player—though not the smallest overall, which is

.

The shortest solution to a problem in complexity theory that was important and open for several years was:

Well this needs a little context. If either or is the identity, then this *group commutator* is the identity. Else in a large enough permutation group like , you can rig it so that is never the identity. Thus if you a piece of code such that if some property holds and otherwise, and likewise gives iff holds, then the code computes the AND gate .

Since it is easy to code by the code , and since AND and NOT is a complete basis, the upshot is that all -depth Boolean formulas can be computed by codes of size . Permutations of 5 elements allow codes in the form of *width-5 branching programs*, so what tumbles out is the celebrated theorem of David Barrington that the complexity class has constant-width branching programs. GLL’s old post on it played up the simplicity, but like many mate-in-*n* chess problems it was not simple to see in advance.

More recently, a seemingly complicated conjecture by Richard Stanley and Herbert Wilf about permutations was proved in an 8-page paper by Adam Marcus and Gábor Tardos. For any sequence of distinct integers define its *pattern* to be the unique permutation of that has the same relative ordering. For instance, the pattern of is . Given any length- pattern and define to be the set of permutations of for which *no* length- subsequence (not necessarily consecutive) has pattern . Whereas grows as , Stanley and Wilf conjectured that the maximum growth for is , for any fixed . Marcus and Tardos polished off a stronger conjecture by Zoltán Füredi and Péter Hajnal about matrices in basically 1.5 pages whose short lemmas read like sack, sack, mate.

We found the last example highlighted in a StackExchange thread on open problems solved with short proofs. Another we have seen mentioned elsewhere and covered in a post is Roger Apéry’s proof that is irrational, which centered on deriving and then analyzing the new identity

The Euler examples could be ascribed mostly to “computer-guided good fortune” but the other three clearly involve a new idea attached to a crisp new object: an equation or group or matrix construction. We wonder whether and when the possibility of such objects can be “sniffed out.”

I sometimes wonder if we have missed a simple proof—no, short is a better term—a short proof that would resolve one of our favorite open problems. Here a couple that Ken and I suggest may fall into this category.

Péter Frankl’s conjecture. Surely it just needs an equation or formula for the right kind of “potential function”? We noted connections to other forms in a followup post.

Freeman Dyson’s conjecture. It stands out even among number-theory statements that are “overwhelmingly probable” in random models, and there ought to be some principle for boxing away potential counterexamples along lines of how certain numbers with fast approximations can be proved to be transcendental. (Then again hasn’t yet been proved transcendental.)

Problems involving deterministic finite automata (DFAs) are often easy to solve or show to be decidable. But here is one from Jeff Shallit that is not: Given a DFA, does it accept a string that is the binary representation of a prime number—is this decidable?

Even simpler is the question, given binary strings and of length , of finding the smallest DFA that accepts but not (or vice-versa). Can we put tight bounds on the size of in terms of ?

Can a Boolean circuit of size computing a function be converted into a circuit of size with gates each computing the XOR of with ? This would be a Boolean analogue of the famous Derivative Lemma of Walter Baur and Volker Strassen, which we have discussed here and here.

By contrast, the 3n+1 conjecture of Lothar Collatz does not strike us as a candidate. John Conway proved that related forms are undecidable, and we suspect that it will need deep new knowledge of how intuitively multiplicative properties of integers duck-and-weave under the additive structure.

Regarding the Skolem problem we are on the fence: in a sense it involves DFAs which should be easy, but even for small the connections to deep approximation problems in number theory become clear.

Here are four others—what do you think?:

*Can polynomials modulo compute the majority function?* Of course we mean polynomials of low degree and we allow some flex on the meaning of “compute” provided it relates to polynomial-sized constant-depth circuits.

*Can we at least prove that SAT cannot be computed in linear time on a Turing Machine?* Note that this does not follow from the proof of as we discussed here. For all those of you who are sure that not only does , but also that ETH is true, how can we not prove this “trivial lower bound?”

*Can whether an -node graph has a triangle be decided in time?* Or at least better than matrix-multiplication time? Ken knew a visitor to Buffalo who admitted spending a year on this with little to show for it.

*Can we solve the lonely runner problem?* Or at least improve the number of runners that it is known to be true for?

How can we possibly forecast how open an open problem is?

]]>

*An insight into the computation of financial information*

Columbia memorial source

Joseph Traub passed away just a week ago, on August 24th. He is best known for his computer science leadership positions at CMU, Columbia, CSTB, the *Journal of Complexity*—they all start with “C.” CSTB is the Computer Science and Telecommunications Board of the National Academies of Science, Engineering, and Medicine. At each of these he was the head and for all except Carnegie-Mellon he was the first head—the founder.

Today Ken and I wish to highlight one technical result by Traub and his co-workers that you may not know about.

I knew Joe for many years, and he will be missed. He had a sharp mind and corresponding sharp wit. There was a slight twinkle in his eye that would have called to mind a leprechaun had he been Irish rather than German—from a family that left after the Nazis seized his father’s bank in 1938 when he was six. He was a delight to interact with on almost any topic.

Joe created an entire area of theory: information-based complexity theory. To quote its website:

Information-based complexity (IBC) is the branch of computational complexity that studies problems for which the information is

partial,contaminated, andpriced. [Functions over the unit cube] must be replaced by … finite sets (by, for example, evaluating the functions at a finite number of points). Therefore, we have onlypartialinformation about the functions. Furthermore, the function values may becontaminatedby round-off error. Finally, evaluating the functions can be expensive, and so computing these values has aprice.

Traub was helped greatly by a brilliant colleague, Henryk Woźniakowski, who developed key ideas alongside him in the 1970s that were distilled in their 1980 monograph, *A General Theory of Optimal Algorithms*. A shorter source is Joe’s seminal paper, “An Introduction to Information-Based Complexity.” A statement on Woźniakowski’s Columbia webpage shows his sustaining interest in IBC:

I am mostly interested in computational complexity of continuous problems. For most continuous problems, we have only partial information. For problems defined on spaces of functions, this partial information is usually given by a finite number of function values at some points which we can choose. One of the central issues is to determine how many pieces of information, or function values, are needed to solve the computational problem to within a prescribed accuracy.

I must admit that years ago I opened my *Bulletin of the AMS*, in 1992 to be exact, and was shocked to see an article titled: “Some Basic Information On Information-Based Complexity Theory” written by the eminent numerical analyst Beresford Parlett. The abstract got right to the point:

…Why then do most numerical analysts turn a cold shoulder to IBCT? Close analysis of two representative papers reveals a mixture of nice new observations, error bounds repackaged in new language, misdirected examples, and misleading theorems.

What is interesting about this paper is that the attack on IBCT=IBC is on the idiosyncrasies of its framework. The claim is not that there are errors—technical errors—in the IBC papers, but rather that the model, the framework, is uninteresting and misleading. You should take a look at the very readable paper of Parlett to decide for yourself.

Traub of course disagreed and had a chance to fire back an answer. I believe that Parlett’s attack actually had little effect on IBC, except to make it better known in the mathematical community.

What was the tiff about? We can try to boil it down even more than Parlett’s article does.

Integrals are fundamental to most of mathematics. And in many practical cases the integrals cannot be exactly computed. This leads to a huge area of research on how to approximate an integral within some error. Even simple ones like

can be challenging for nasty functions . Worse still are multi-dimensional integrals

where is a -dimensional region. Speaking very roughly, let’s say and what matters to is whether components are “low” (toward ) or “high” (toward ) in each dimension. If the gradient of in each dimension behaves differently for each setting of other components being “low” or “high,” then we might need samples just to get a basic outline of the function’s behavior. The IBC website fingers this exponential issue as the “curse of dimensionality.”

Now as ordinary complexity theorists, our first instinct would be to define properties intrinsic to the **function ** and try to prove they cause high complexity for **any** algorithm. Making a continuous analogy to concepts in discrete Boolean complexity, drawing on papers like this by Noam Nisan and Mario Szegedy, we would try to tailor an effective measure of “sensitivity.” We would talk about functions that resemble the -ary parity function in respect of sensitivity but don’t have a simple known integral. Notions of being “isotropic” could cut both ways—they could make the sensitivity pervasive but could enable a good global estimate of the integral.

IBC, however, focuses on properties of **algorithms** and restrictions on the kind of **inputs** they are given. Parlett’s general objection is that doing so begs the question of a proper complexity theory and reverts to the standard—and hallowed enough—domain of ordinary numerical analysis of algorithms.

Ken and I find ourselves in the middle. Pure complexity theory as described by Parlett has run into barriers even more since his article. We wrote a pair of posts on bounds against algorithms that are “progressive.” The IBC website talks about **settings** for algorithms. This seems to us a fair compromise notion but it could use more definition.

One classic setting for approximating these high-dimensional integrals is the Monte-Carlo method. The common feature of this setting is that one samples points from and takes the average of the values of . This method yields a fair approximation, but the convergence is slow. For example, the dimension is : the number of months in a 30 year **collateralized mortgage obligation** (CMO). To get error requires evaluations of the function—clearly a completely hopeless task. Financial firms on Wall Street that needed to estimate fair market values for CMOs were using Monte Carlo methods as best they could.

Enter Traub with his Columbia IBC research group in the early 1990s. They—in particular his student Spassimir Paskov in late 1993 and early 1994—discovered and verified that a non-random method would beat the Monte-Carlo by many orders of magnitude. This was based on an old idea that uses low-discrepancy sequences. These are sequences of points inside a region that have good coverage of the space. For the unit cube as above, one may specify a class of natural measurable subsets of ; then low discrepancy means that for every the proportion of points in that belong to is close to the measure of —and moreover this happens for every long enough consecutive subsequence of .

In the worst case the resulting *Quasi-Monte Carlo* (QMC) methods would perform much poorer than Monte-Carlo methods. However, the surprise was that for actual financial calculations they worked well—really well.

As with any change in basic methodology, the work and claims of Traub’s group were met with great initial skepticism by the financial firms. However, the performance was and remains so good that many financial calculations still use QMC. This raised a further beautiful problem that loops back to the initial mission of IBC:

Whydo the discrepancy-based methods perform so well?

That is, what properties of the particular high-dimensional integrals that occurred with CMOs and other applications that made non-random methods work so well?

There are some ideas of why this is true in the actual financial calculations, but the basic question remains open. Wikipedia’s IBC article picks up the story in a way that connects to Parlett’s point though neither he nor his article is mentioned:

These results are empirical; where does computational complexity come in? QMC is not a panacea for all high-dimensional integrals. What is special about financial derivatives?

It continues:

Here’s a possible explanation. The 360 dimensions in the CMO represent monthly future times. Due to the discounted value of money variables representing times for in the future are less important than the variables representing nearby times. Thus the integrals are non-isotropic.

Working with Ian Sloan, Woźniakowski introduced a notion of *weighted spaces* that leverages the above observation. Their 1998 paper laid down conditions on the integrals and showed that they are rendered tractable in the QMC setting, often in worst case where Monte Carlo gives average case at best. How far this positive result can be extended—for which other classes of integrals does QMC beat MC—is a concrete form of the wonderful problem.

Our condolences to the Traub family and to our colleagues at Columbia and Santa Fe. The Santa Fe Institute’s memorial includes a nice statement by our friend Joshua Grochow relating to workshops led by Joe and a joint paper on limits to knowledge in science.

[modified 1990s dates on QMC for CMOs following this; see also this comment.]

]]>

Cricketing source |

Andrew Granville is a number theorist, who has written—besides his own terrific research—some beautiful expository papers, especially on analytic number theory.

Today Ken and I wish to talk about his survey paper earlier this year on the size of gaps between consecutive primes.

The paper in question is here. It is a discussion of the brilliant breakthrough work of Yitang Zhang, which almost solves the famous twin-prime conjecture. You probably know that Zhang proved that gaps between consecutive primes are infinitely often bounded by an absolute constant. His constant initially was huge, but using his and additional insights it is now at most and plans are known that might cut it to .

As Andrew says in his paper’s introduction:

To Yitang Zhang, for showing that one can, no matter what

We believe we need the same attitude to make progress on some of our “untouchable” problems. Perhaps there is some budding complexity theorist, who is making ends meet at a Subway™ sandwich shop, and while solving packing problems between rolls in real time is finding the insights that can unlock the door to some of our open problems. Could one of these be ready to fall?

- Is NP closed under complement?
- What is the power of polynomials?
- Can we at least prove SAT is not computable in linear time?
- And so on …

Who knows.

Throughout mathematics—especially number theory—computing the number of objects is of great importance. Sometimes we can count the exact number of objects. For example, it is long known that there are exactly labeled trees, thanks to Arthur Cayley.

The number of labeled planar graphs is another story—no exact formula is known. The current best estimate is an asymptotic one for the number of such graphs on vertices:

where and .

Thanks to a beautiful result of the late Larry Stockmeyer we know that estimating the number of objects from a large set may be hard, but is not too hard, in the sense of complexity theory. He showed that

Theorem 1For any one can design an -time randomized algorithm that takes an NP-oracle , a predicate decidable in -time (where ) and an input and outputs a number such that with probability at least , the true number of such that holds is between and

The predicate is treated as a black box, though it needs to be evaluated in polynomial time in order for the algorithm to run in polynomial time. The algorithm has a non-random version that runs within the third level of the polynomial hierarchy.

We won’t state the following formally but Larry’s method can be extended to compute a sum

to within a multiplicative error without an NP-oracle, provided the map is computable in polynomial time, each , and the summands are not sparse and sufficiently regular. Simply think of , identify numbers with strings so that runs over , and define to hold if .

Larry’s method fails when a sum has both negative and positive terms; that is, when there is potential cancellation. Consider a sum

Even if the terms are restricted to be and , his method does not work. Rewrite the sum as

Then knowing and approximately does not yield a good approximation to the whole sum . We could have and where is a large term that cancels so that the sum is . So the cancellation could make the lower order sums and dominate.

This happens throughout mathematics, especially in number theory. It also happens in complexity theory, for example in the simulation of quantum computations. For every language in BQP there are Stockmeyer-style predicates and such that equals the number of Hadamard gates in poly-size quantum circuits recognizing whose acceptance amplitude is given by

Although the individual sums can be as high as their difference “miraculously” stays within , and the former is never less than the latter—no absolute value bars are needed on the difference. See this or the last main chapter of our quantum algorithms book for details. Larry’s algorithm can get you a approximation of both terms, but precisely because the difference stays so small, it does not help you approximate the probability. This failing is also why the algorithm doesn’t by itself place BQP within the hierarchy.

What struck Ken and me first is that the basic technique used by Yitang Zhang does not need to estimate a sum, only to prove that it is positive. This in turn is needed only to conclude that *some term* is positive. Thus the initial task is lazier than doing an estimate, though estimates come in later.

The neat and basic idea—by Daniel Goldston, János Pintz, and Cem Yíldírím in a 2009 paper that was surveyed in 2007—uses indicator terms of the form

where is defined to be if is prime and is otherwise. Here runs from to and . If the term is positive then at least two of the elements and must be prime, which means that the gap between them is at most the fixed value . Doing this for infinitely many yields Zhang’s conclusion that infinitely many pairs of primes have gap at most . Can it really be this simple?

The strategy is to minimize , but it needs to provide a way to get a handle on the function. This needs forming the sum

where the are freely choosable non-negative real weights. The analysis needs the to be chosen so that for every prime there exists such that does not divide any of . This defines the ordered tuple as *admissible* and enters the territory of the famous conjecture by Godfrey Hardy and John Littlewood that there exist infinitely many such that are **all** prime. This has not been proven for **any** admissible tuple—of course the tuple gives the Twin Prime conjecture—but thanks to Zhang and successors we know it holds for for **some** .

The analysis also needs to decouple from the —originally there was a dependence which only enabled Goldston, Pintz, and Yíldírím to prove cases where the gaps grow relatively slowly. Andrew’s survey goes far into details of how the sum and its use of the function must be expanded into further sums using arithmetic progressions and the Möbius function in order for techniques of analysis to attack it. The need for estimation heightens the cancellation problem, which he emphasizes throughout his article. The issue is, as he states, of central importance in number theory:

The two sums … are not easy to evaluate: The use of the Möbius function leads to many terms being positive, and many negative, so that there is a lot of cancellation. There are several techniques in analytic number theory that allow one to get accurate estimates for such sums, two more analytic and the other more combinatorial. We will discuss them all.

As usual see his lovely article for further details. Our hope is that they could one day be used to help us with our cancellation problems.

I thought that we might look at two other strategies that can be used in complexity theory, and that might have some more general applications.

**Volume trick**

Ben Cousins and Santosh Vempala recently wrote an intricate paper involving integrals that compute volumes. At its heart however is a simple idea. This is represent a volume in the form

where is the hard quantity that we really want to compute. The trick is to select the other ‘s in a clever way so that each ratio is not hard to approximate. Yet the cancellation yields that

If is simple, and if we have another way to estimate , then we get an approximation for the hard to compute .

**Sampling trick**

Let and where is a random function with expectation . Then we know that is equal to . Note that this does *not* need the values to be independent.

Can we use this idea to actually compute a tough sum? The sums we have in mind are inner products

where the vectors and are exponentially long but have useful properties. They might be succinct or might carry a promised condition such as we discussed in connection with an identity proved by Joseph Lagrange.

Both cases arise in problems of simulating quantum computations. Our hope is that the regularities in these cases enable one to restrict the ways in which cancellations can occur. A clever choice of *dependence* in the values then might interact with these restrictions. Well, at this point it is just an idea. Surveying things that boil down to this kind of idea could be another post, but for now we can invite you the readers to comment on your favorite examples.

Can we delineate methods for dealing with cancellations, also when we don’t need to estimate a sum closely but only prove it is positive?

[fixed omission of NP-oracle for Stockmeyer randomized counting]

]]>

Victor Shoup is one of the top experts in cryptography. He is well known for many things including a soon to be released book that is joint with Dan Boneh on, what else, cryptography; and the implementation of many of the basic functions of cryptography.

Today I want to talk about my recent visit to the Simons Institute in Berkeley where I heard Victor give a special lecture.

This lecture led to an interesting question:

How do we know if a crypto function is correctly implemented?

Before I discuss this let me report that this summer the institute is running an excellent program on cryptography. It was organized by Tal Rabin, Shafi Goldwasser, and Guy Rothblum. While I just visited for a short time last week, it seems to be an incredibly well run program, with much of the thanks going to Tal. Tal added social interaction as part of the program, about which more below. The institute director, Dick Karp, told me he thought these additional social activities were wonderful and they helped make the program so enjoyable and productive.

Victor gave his special talk on mostly known—to experts—results in basic cryptography. His talk was well attended, well structured, and well delivered. He made three points during the talk that were striking to me:

- He wrote all of the operations used in his talk using “additive” notation. Most crypto systems use some abelian group as the base of operations, of course some use more than one. Usually these groups are written using multiplicative notation: . Victor switched this to . This is of course not a major difference, but it had some nice properties. Also, for those of us who are comfortable with the usual notation, it made us have to rethink things a bit.
- He repeatedly made the comment that while many of the provable methods he presented where fast enough to be practical, they were not used. He essentially said that provable methods seemed to be avoided by real system builders. As you might image this was not a welcome comment to the group that consisted mostly of researchers who work in cryptography. Oh well.
- He also made another point: For the basic type of systems under discussion, he averred that the mathematics needed to describe and understand them was essentially high school algebra. Or as he said, “at least high school algebra outside the US.” This is not a negative point: there is nothing wrong with the systems requiring only simple operations from algebra. Deep mathematics is great, but not necessary to make a great cryptographic system.

Kathryn Farley, a researcher in performance arts, was at Victor’s talk, since she was also in town. She could not stay for the entire 90-minute lecture—she left after about one hour for another appointment. Later that day we talked and she said she had thought about asking a question to Victor, but not being an expert in cryptography, was unsure if her question was naive or not. I asked her what it was and she replied with the question atop this post. To say it again:

How do we know if a crypto function is correctly implemented?

I told her that it was, in my opinion a terrific question, one that we often avoid. Or at best we assume it is handled by others, not by cryptographers. I immediately said that I have discussed this and related questions before here and I still am unsure what can be done to be ensure that crypto functions are correctly implemented.

Suppose that is claimed to return a random number: of course it is a pseudo-random number generator (PRNG). Lets assume that it claims to use a strong PRNG method. How can we know? We can look at the code and check that it really works as claimed, but that is messy and time consuming. Worse it defeats the whole purpose of having a library of crypto functions.

A nastier example: Suppose that uses the following trick. If the date is before January 1, 2016, it will use the correct strong generator. If the date is later, then it will use a simple PRNG that is easy to break. Further it makes this happen in a subtle manner, which is extremely hard to detect by code inspection. How would we discover this?

Friday at the institute was “The Tea Party.” At the end of the day all were invited to be outside near the institute’s building, and join in eating, drinking, and playing a variety of games.

The food was quite impressive—much better than the standard fare that we see at most university gatherings. It was also well attended, which probably is related to the quality of the food.

Kathryn and I were there. And it was a perfect time and place to ask some of the cryptographers about her question. We got several answers, most of which were not very satisfying. I will list the answers without directly quoting anyone—protecting the innocent, as they say.

Some said that the issue was not about crypto, but about software engineering. So it was someone else’s issue.

Some were more explicit and said that software engineers should use verification methods to check correctness.

Others said that the code should be checked carefully by making it open source: basically relying on the “crowd” for correctness.

None of these methods is foolproof. Which raises the question: how can we actually be sure that all is well?

I was not very happy with any of the answers that we got. I would suggest that this correctness problem is special and because it is about crypto systems it may have approaches that work that do not work for arbitrary code.

Here are some ideas that perhaps can be used: I think they are just examples of what we may be able to do, and welcome better ideas.

*Property Testing:*

Many crypto functions satisfy nontrivial mathematical properties, which is quite different from arbitrary code. Consider RSA. One could check that

And the same with the decode function. This doe not imply correctness, but a failure would mean the code is incorrect.

* Version Testing:*

If the code is critical one could in principle implement the function twice, or more times. This is best done by different vendors. Then the outputs would be checked to see if they are identical. Note since crypto functions usually operate over finite fields or rings, there is no rounding errors. Thus, the different implementations should get identical values.

Note this is a standard idea in software engineering—see this. The key assumption is that different versions should have different errors. This appears not to be the case in practice, but the method still has something behind it. Wikipedia says: -version programming has been applied to software in switching trains, performing flight control computations on modern airliners, electronic voting (the SAVE System), and the detection of zero-day exploits, among other uses.

*Check No State:*

A crypto function should be a function and have no state. The date “attack” I gave earlier shows that functions should be forced to not be able to create any state. I believe that it may be possible to “sandbox” the code of a alleged function so that it cannot keep any state. This obviously makes the date trick impossible.

How do we be sure that crypto functions are correctly implemented? I suggest simply that cryptographers should start to view this issue as another type of attack that can be used against their system. Perhaps viewed this way will lead to new ideas and better and more secure systems.

]]>

* How important is “backward compatibility” in math and CS? *

David Darling source

Henri Lebesgue came the closest of anyone we know to changing the value of a mathematical quantity. Of course he did not do this—it was not like defining π to be 3. What he did was change the accepted definition of *integral* so that the integral from to of the characteristic function of the rational numbers became a definite . It remains even when integrating over all of .

Today we talk about changing definitions in mathematics and computer programming and ask when it is important to give up continuity with past practice.

Continuity is an example of such continuity. Bernard Bolzano’s original definition of 1817 amounts to much the same as the standard “epsilon-delta” definition by Karl Weierstrass a half-century later: A function between metric domains and is continuous if for every point and there is a such that for all points having , it follows that . The newer definition is that is continuous with respect to topologies on and on if for every open set of , the inverse image

is an open set of .

This is backward-compatible: The metrics uniquely define topologies on and whose basic open sets have the form and similarly for . For these topologies the newer definition is equivalent to the one. Of course one virtue of the new definition is that it can be used for topologies that do not give rise to metrics.

Lebesgue’s change to the meaning of *integral* was not backward compatible in terms of the *upper limit*. It changed the status in texts that referenced Bernard Riemann’s definition which used approximations by rectangles. For the characteristic function of the rational numbers in the Riemann upper sum is whereas the Lebesgue upper sums—using open coverings that need not induce partitions of —approach zero and so meet the lower sum. It *is* compatible whenever a function satisfies Riemann’s definition that the upper and lower approximations by rectangles meet in the limit. One can still talk about Riemann integrability, perhaps as modified by Thomas Stieltjes, but the difference is that you must include one or both names—if you don’t then you default to Dr. Lebesgue’s definition, and *then* the status for functions like is different. Most to our point, the “interface” for working with the Lebesgue integral differs—especially in employing the notion of open coverings.

Compare whether an audio “recording” can still refer to a vinyl phonograph record. When I was a teenager we said “digital audio” or “digital recording” and there was much debate about whether the quality could ever equal that of a physical LP record. It didn’t take long for progress in recording density and playback to tilt the field toward those who feel CDs sound better. Now I don’t think the analog meaning is ever intended without a qualifier.

Digital opens new vistas of sound, much as Lebesgue’s definition founded a great expansion of real and complex analysis. The point we’re making is that it is not backward compatible in operation. You can play the same music as an LP—much as the Lebesgue integral gives the same value as the Riemann integral for most common functions—but you cannot slip an LP into a CD player. Whether CDs should be playable on DVDs, and DVDs on later optical drives, has been the burning issue on the design side.

Less clear is whether electric guitars are a break from acoustic guitars. The notation and interface for playing them are largely the same. It is even possible to bolt on an extension to make an acoustic guitar sound electric at the source. Thus electric guitars have not really changed the definition of “guitar.”

Also problematic is whether a change in nomenclature amounts to a change in definition. I support the position that “pi” should have been defined as 2π = 6.283185307… We’ve argued that the Indian pioneer Aryabhata had this value in mind in the 5th century CE. Doing so would change the look of equations but not the interface.

Physicists use more often than they use Max Planck’s original constant . Sometimes is called the “reduced” Planck constant or named for Paul Dirac, but even when “Planck’s constant” is used to mean in speech nothing is actually being *redefined*. The two notions of ‘pi’ or ‘‘ are completely convertible.

I’ve been thinking of this recently because I’ve adopted two breaks from the standard definition of codes for chess positions, which I covered earlier this year. One fixes an admitted mistake by the definition’s second creator while the other is needed to adapt to more general forms of chess. Let’s see how they capture in microcosm some larger issues in research and everyday programming.

Consider the position after the reasonable moves 1. e4 e5 2. d4 exd4 3. Bc4 Bb4+ 4. Bd2 Bc5 5. Nf3 Nf6 6. e5 Qe7 7. Bb3 d5:

The FEN code for this position—minus the last two components which do not affect the *position* according to the laws of chess—is:

rnb1k2r/ppp1qppp/5n2/2bpP3/3p4/1B3N2/PPPB1PPP/RN1QK2R w KQkq d6

The first part tells where the pieces are, then ‘w’ means White is to move, and ‘KQkq’ means that both White and Black retain the right to castle both kingside and queenside. The fourth part indicates that Black’s last move 7…d5 was with a pawn that jumped over the square d6, which *might* enable an en-passant capture. In fact there is a White pawn on e5 in a position where it could possible make such a capture, but the move “8. exd6 e/p” is not possible because it would leave White’s king in check from Black’s queen. Note that the ‘Q’ in ‘KQkq’ does not mean White can castle queenside right now, just that it is possible in the future. Castling kingside is legal right now, and indeed 8. O-O is White’s best move. Suppose however that White plays 8. Bg5 but after Black’s 8…Bb4+ chickens out of the strong 9. c3 and meekly returns the Bishop by 9. Bd2 followed by the reply 9…Bc5. The position on the board is the same:

However, the FEN code (again minus the last two fields which do not determine the *position*) is now:

rnb1k2r/ppp1qppp/5n2/2bpP3/3p4/1B3N2/PPPB1PPP/RN1QK2R w KQkq –

The en-passant marker for the square d6 has disappeared, so these are different codes.

Now the 3-fold repetition rule in chess allows the side to move to claim a draw if the intended move will cause the third (or higher) occurrence of the resulting *position* in the game. When an en-passant move is legal on the first occurrence of the “board position” then it is a different *position*, so two more occurrences of the board setup do not allow the draw claim. Here, however, if White waffled again by 10. Bg5 Bb4+ 11. Bd2, Black really would be able to claim a draw with intent to play 11…Bc5. The en-passant move is not legal, so White’s immediate options were the same, and the set of options (including castling in the future) is what defines a *position*.

The issue is that the FEN codes do not reflect this. Computer chess programs want to detect not only 3-fold but even 2-fold repetitions to avoid time-wasting cycles in their search. It would be most convenient to tell this from identity of the FENs, without having to build the chess position and examine pins to check for legality. Unfortunately the FEN standard mandates inserts like the ‘d6’ even when there is no nearby pawn at all.

I have therefore adopted the stricter practice of Stockfish and some other chess programs by including the target square only when an en-passant capture is actually legal. Steven Edwards—the ‘E’ in ‘FEN’—advocated the same four years ago. Happily the difference does not matter to playing or analyzing chess games—so you can say it’s “compatible” for the end-user. It is however a departure at the API level. If an end user or another program submits a non-strict FEN then your program needs to convert it internally. This is painless since you are building the position from the FEN anyway. The second deviation is more consequential for users.

A mark of Bobby Fischer’s genius apart from playing games is that while many champions have mused on tweaking the rules of chess, not just one but two of Fischer’s inventions have gained wide adoption. The greater was the Fischer clock, which metes a portion of the allotted thinking time in *increments* with each move besides the lump amount at the start of a game session. This lessens the acuteness of *time pressure* and is now standard.

The second is Fischer Random chess, which is more often now called Chess960 for the 960 different possible starting positions of pieces on the back row, from which one is randomly selected at the start of a game. Variant setup rules had been proposed by many including the onetime champion José Capablanca and challenger David Bronstein, but in my mind the stroke that makes all of these “click” is Bobby’s generalized castling rule. Adding it to Bronstein’s original idea yields this.

This allows the king and rooks to start on any squares provided the king is between the rooks. The between-ness preserves the meaning of castling ‘queenside’ and ‘kingside’ and indeed the destination positions of the king and the rook used to castle are the same as for those moves in standard chess. The other conditions are the same as in standard chess: any squares between the king and the rook must be clear, neither the king nor that rook must previously have moved, and none of the squares traversed by the king may be attacked by the enemy—though this is OK for the rook. If the Chess960 position happens to be the standard starting one then the whole game rules are completely the same—this is a “conservative extension” of chess. What’s not conserved is the standard notation for castling: O-O and O-O-O in game scores, nor e1c1 or e1g1 (e8g8 and e8c8 for Black) in internal chess program code.

The first problem emerges when you think of a Chess960 position with Black’s king starting on f8 rather than e8, say with the rooks on b8 and h8, such as this one:

The notations O-O and O-O-O are still clear, but the corresponding internal notation f8g8 is ambiguous. It could be a normal King move without castling. This is solved by changing the internal notation to f8h8 in that case, figuratively “king takes own rook,” and f8b8 as the internal code for O-O-O instead. Many chess programs accept the “king takes rook” style from other programs even in standard chess, and there is no issue for the end user.

The second problem however is with the external notation when playing Chess960. It is subtle: Suppose Black’s rook on h8 moves to h6, moves later to a6 on the other side of the board, and then has occasion to retreat to its own back row on a8. The first rook move eliminated Black’s kingside castling right, so the original ‘kq’ part of the FEN would become just ‘q’. The problem is that the FEN does not preserve the game history, and at the moment of Black’s Ra8 move, it has forgotten which rook was originally resting on the queenside. If Black subsequently moves the other rook away from b8, or moves the rook on a8 yet a fourth time, how are we to tell whether the ‘q’ castling right persists?

Including the game history in the FEN is not an option—else we could also solve the repetition-count problem whose full headaches could make another post. Various “memoryless” solutions have been proposed, from which my choice is that of Stefan Meyer-Kahlen, designer of the Shredder chess program and co-creator of the standard UCI protocol for communicating with chess programs. A “Shredder-FEN” replaces the ‘KQkq’ by the files of the rooks, again using capitals for White. This eliminates the above ambiguity, as now the ‘q’ reads ‘b’ for the rook on b8.

I’ve chosen to use Shredder-FENs not just for Chess960 but also internally for standard chess, and my code—which I will say more about in a post shortly—also accepts Shredder-FENs. Using both changes, my code stores the following as the FEN for the above diagram at White’s move 8 (including now the two last fields):

rnb1k2r/ppp1qppp/5n2/2bpP3/3p4/1B3N2/PPPB1PPP/RN1QK2R w HAha – 0 8

The ‘HAha’ is no laughing matter—it also simplifies updates when playing a move, partly because unlike ‘KQkq’ the letters [A-Ha-h] do not occur elsewhere in the FEN. My program can export standard FENs—even for Chess960 notwithstanding the ambiguity—but its API committally changes the definition of a FEN.

What are your favorite changes to definitions in mathematics and computing theory? I have not even gone into the myriad definitions of universal hash functions. When is it important to make a clean break with a previous standard, without preserving backward compatibility?

[fixed introduction which confused the Riemann integral with its upper sum; added mention of Bronstein+Fischer variant.]

**Oops**—I hit publish while logged in on the joint ‘Pip’ handle; the ‘I’ in this post is only I, KWR.

]]>