Taming Some Inequalities
As used to solve a classic problem about distinguishing distributions
Composite of src1, src2
Gregory Valiant and Paul Valiant are top researchers who are not unrelated to each other. Families like the Valiants and Blums could be a subject for another post—or how to distinguish from those who are unrelated.
Today Ken and I wish to talk about a wonderful paper of theirs, “An Automatic Inequality Prover and Instance Optimal Identity Testing.”
I really like this paper, which I saw at FOCS 2014. I am less keen on the title—it suggests to me that they are working on some type of AI area. The phrase “Automatic Inequality Prover” sounds like an algorithm that takes in an inequality and outputs a proof that it is correct. Of course it does this only when the inequality is true. This is not what they do.
Ken chimes in: The paper has a neat mix of other ideas. There are natural cases of extremely succinct descriptions of strings that would take double exponential time to write out. It is notable that the formal system underlying their derivation of inequalities does not yield exponential complexity (let along undecidability) but stays within linear programming—do -completeness results lurk? The instance-optimality notion is a hybrid of uniform and nonuniform lower bounds in a hybrid white-box/black-box setting.
The Result on Distributions
They study the problem of telling one distribution from another. Suppose that and are two discrete distributions:
The question is how many independent samples are needed to distinguish the two cases: (i) from (ii) .
The answer is too many, since and could be very close. So they replace (ii) by the weaker condition that and are far apart in the metric,
Now they are able to give essentially optimal answers to the question. There is a twist on the notion of optimal. Given the distribution explicitly as an -tuple, their algorithm is allowed to run within factors that depend on . Their optimal bounds do not even separate out as a product , though they give close upper bounds that do. In fact, they technically have the form where the “” part governs an adjustment to the norm of for the upper bound. The lower bound is expressed by the statement:
Theorem 1 There is a global constant , , such that for all distributions all sufficiently small (possibly depending on ), and every tester given white-box knowledge of but limited to samples of size from the unknown distribution there exists a distribution that is either or has , such that the probability over of giving the wrong answer is at least .
We can “almost” jump ahead of the choice of in the quantifier order. Trivially we can’t: if is the “tester” that always says “distinct” then we need to choose to break it. But if avoids false positives when , then what they build is a distribution such that a random member of impersonates well enough on small samples to make give a false negative. What’s neat is that no uniformity or complexity bound is needed on —it’s a combinatorial argument directly on and the small sample size. The white-box nature of separates their bounds from the higher samples needed by a tester to learn , if too were unknown.
See their paper for the details on , discussion of instance-optimality, and delicate issues such as not being able to define-away the by adjusting . (In the paper there are constants rather than .) Their relaxed upper bound is indeed simple:
This improved a previous upper bound that had in the denominator and polylog() factors in the numerator. That is quite a jump from a beautifully tight work of analysis.
Hairy and Basic Inequalities
What strikes us even more is the “Automatic ” part. In order to analyze their algorithm for deciding the above question on distributions, they are naturally led to consider some “hairy” inequalities—their term. I suspect that Greg and Paul are quite facile at handling almost any inequalities, so for them to say the inequalities were hairy means they were indeed complicated inequalities.
This leads to the neatest part of their paper, in our opinion. They give a complete characterization of a general class of inequalities that extend Cauchy-Schwarz, Hölder’s inequality, and the monotonicity of norms. Here —the traditional letter, not the same as distributions above—means that the norm takes the sum of absolute -th powers and outputs the positive branch of . As they implicitly do, let’s switch to in the norm context.
For example, their expression uses the norm of the distribution . The choice may seem strange, but for the uniform distribution ,
so it relates to long-known bounds involving samples for uniform distribution. The inequalities in their analysis, however, involve other values of the norm power . For , the following inequality for non-negative numbers is equivalent to the norm being monotone decreasing in :
If we raise both sides to the power we see what happens for :
so with , that is , we get the equivalent form
Thus the original form (1) for captures all the needed information, while the form for is a kind of dual. This also shows the idea of substituting powers for .
When we have a second non-negative sequence we can give Otto Hölder’s generalization for of Cauchy-Schwarz, which is the case :
We can obtain dual forms here too, and it helps to consider substitutions of the form and similarly for . Doing so, and dividing out the right-hand sides, we obtain the basic forms for and Hölder with :
What They Prove
Let’s just state their main auxiliary result. Specifically, they determine for all length- sequences , , and of rational numbers whether the inequality
holds for all and all non-negative real -vectors and . For example, taking , , , and gives us the Hölder inequality with , that is, Cauchy-Schwartz.
Theorem 2 The inequality (3) holds if and only if the left-hand side can be expressed as a finite product of positive powers of instances of the left-hand sides of the basic and Hölder inequalities.
Moreover, there is an algorithm that in time polynomial in and the maximum bits in any , , or will output the terms of such a product if the inequality is true, or a description of a counterexample if it is false.
The proof uses variables standing for the logarithms of sums of the form and takes logs of the inequalities to make linear constraints in these variables with non-negative values. The key idea is that if a derived objective function has a nonzero optimum then its argument (and a neighborhood of it) gives counterexamples, whereas if the optimum is then the dual program comes into play to yield the desired product. The bound on is not explicitly represented, but it comes out in counterexamples.
The process is so nicely effective that they are able to represent it by a human-friendly game on grid-points in the plane that is somewhat reminiscent of (and easier than) the solitaire peg-jumping game—well worth a look. They say:
Our characterization is of a non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought through trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will be useful to others, and facilitate analyses like the one here.
In particular, it helped them with inequalities involving that “hairy” number which they used to prove their main results about sampling.
Why Isn’t it Exponential?
The short answer to why their algorithm runs in polynomial time is that it involves linear programming. There isn’t a tight analogy between their game moves and basic pivot moves of the simplex algorithm, however, because there are cases where the latter takes exponential time. It is unclear to us whether they are using a subcase of linear programming that stops short of being P-complete, or whether this will lead to new P-complete problems involving inequalities. They do have an even more general form with ()-many -sequences and summands for to for any , in place of the elements and for .
Most notable is that the minimum dimension of a counterexample can be doubly exponential in the bit-size of the arguments, and yet the algorithm can still describe it. This happens even for as the note: Consider the inequalities
The middle factor is legal since . This is true for but false for any with growing exponentially in . Yet counterexamples have succinct descriptions—as they note, taking makes it false with
There are -many trailing s, and they make large enough for the middle term to overpower the rest. Put another way, the domain formed by their constraints is regular enough that extrema and specific points in their neighborhoods have short descriptions.
We close with a speculation about duality and dimension. The inequalities become equalities only when . The Hölder inequalities become equalities only when is proportional to —that is, they are the same projective point. We wonder if their algorithm can point the way to a useful tradeoff between homogeneity and dimension, which might help tame cases where the dimension is large.
Can you find further applications for their result on inequalities? Can the class of inequalities that can be given an “automatic” treatment be extended further? For one instance we mention the Chebychev sum inequality
which depends critically on the monotone orderings and .