As used to solve a classic problem about distinguishing distributions

 Composite of src1, src2

Gregory Valiant and Paul Valiant are top researchers who are not unrelated to each other. Families like the Valiants and Blums could be a subject for another post—or how to distinguish from those who are unrelated.

Today Ken and I wish to talk about a wonderful paper of theirs, “An Automatic Inequality Prover and Instance Optimal Identity Testing.”

I really like this paper, which I saw at FOCS 2014. I am less keen on the title—it suggests to me that they are working on some type of AI area. The phrase “Automatic Inequality Prover” sounds like an algorithm that takes in an inequality and outputs a proof that it is correct. Of course it does this only when the inequality is true. This is not what they do.

Ken chimes in: The paper has a neat mix of other ideas. There are natural cases of extremely succinct descriptions of strings that would take double exponential time to write out. It is notable that the formal system underlying their derivation of inequalities does not yield exponential complexity (let along undecidability) but stays within linear programming—do ${\mathsf{P}}$-completeness results lurk? The instance-optimality notion is a hybrid of uniform and nonuniform lower bounds in a hybrid white-box/black-box setting.

## The Result on Distributions

They study the problem of telling one distribution from another. Suppose that ${p}$ and ${q}$ are two discrete distributions:

$\displaystyle p = (p_{1}, p_{2}, . . . , p_{n}) \text{ and } q = (q_{1}, q_{2}, . . . , q_{n}).$

The question is how many independent samples are needed to distinguish the two cases: (i) ${p=q}$ from (ii) ${p \neq q}$.

The answer is too many, since ${p}$ and ${q}$ could be very close. So they replace (ii) by the weaker condition that ${p}$ and ${q}$ are far apart in the ${L_{1}}$ metric,

$\displaystyle ||p - q||_{1} \ge \epsilon.$

Now they are able to give essentially optimal answers to the question. There is a twist on the notion of optimal. Given the distribution ${p}$ explicitly as an ${n}$-tuple, their algorithm is allowed to run within factors that depend on ${p}$. Their optimal bounds ${f(p,\epsilon)}$ do not even separate out as a product ${g(p)h(\epsilon)}$, though they give close upper bounds ${f'(p,\epsilon)}$ that do. In fact, they technically have the form ${f(p,\epsilon,\epsilon/16)}$ where the “${\epsilon/16}$” part governs an adjustment to the norm of ${p}$ for the upper bound. The lower bound is expressed by the statement:

Theorem 1 There is a global constant ${\gamma}$, ${0 < \gamma < 1}$, such that for all distributions ${p,}$ all sufficiently small ${\epsilon}$ (possibly depending on ${p}$), and every tester ${T}$ given white-box knowledge of ${p}$ but limited to samples ${R}$ of size ${\gamma\cdot f(p,\epsilon,\epsilon)}$ from the unknown distribution ${q,}$ there exists a distribution ${q}$ that is either ${p}$ or has ${||p - q||_{1} \ge \epsilon}$, such that the probability over ${R}$ of ${T(R)}$ giving the wrong answer is at least ${0.4}$.

We can “almost” jump ${q}$ ahead of the choice of ${T}$ in the quantifier order. Trivially we can’t: if ${T}$ is the “tester” that always says “distinct” then we need to choose ${q = p}$ to break it. But if ${T}$ avoids false positives when ${q = p}$, then what they build is a distribution ${Q_{\epsilon}}$ such that a random member of ${Q_{\epsilon}}$ impersonates ${p}$ well enough on small samples to make ${T}$ give a false negative. What’s neat is that no uniformity or complexity bound is needed on ${T}$—it’s a combinatorial argument directly on ${p}$ and the small sample size. The white-box nature of ${p}$ separates their bounds from the higher samples needed by a tester to learn ${p}$, if ${p}$ too were unknown.

See their paper for the details on ${f(p,\epsilon)}$, discussion of instance-optimality, and delicate issues such as not being able to define-away the ${\epsilon/16}$ by adjusting ${\gamma}$. (In the paper there are constants ${c_1,c_2}$ rather than ${1,\gamma}$.) Their relaxed upper bound ${f'}$ is indeed simple:

$\displaystyle f'(p,\epsilon) = O(\frac{||p||_{2/3}}{\epsilon^2}) = O(\frac{\sqrt{n}}{\epsilon^2}).$

This improved a previous upper bound that had ${\epsilon^4}$ in the denominator and polylog(${n}$) factors in the numerator. That is quite a jump from a beautifully tight work of analysis.

## Hairy and Basic Inequalities

What strikes us even more is the “Automatic ${\dots}$” part. In order to analyze their algorithm for deciding the above question on distributions, they are naturally led to consider some “hairy” inequalities—their term. I suspect that Greg and Paul are quite facile at handling almost any inequalities, so for them to say the inequalities were hairy means they were indeed complicated inequalities.

This leads to the neatest part of their paper, in our opinion. They give a complete characterization of a general class of inequalities that extend Cauchy-Schwarz, Hölder’s inequality, and the monotonicity of ${L_{p}}$ norms. Here ${p}$—the traditional letter, not the same as distributions ${p}$ above—means that the norm takes the sum ${S}$ of absolute ${p}$-th powers and outputs the positive branch of ${S^{1/p}}$. As they implicitly do, let’s switch to ${\lambda}$ in the norm context.

For example, their expression ${f(p,\epsilon)}$ uses the ${L_{2/3}}$ norm of the distribution ${p}$. The choice ${2/3}$ may seem strange, but for the uniform distribution ${u = (\frac{1}{n},\dots,\frac{1}{n})}$,

$\displaystyle ||u||_{\lambda} = (n(\frac{1}{n^{\lambda}}))^{1/{\lambda}} = (n^{1-{\lambda}})^{1/{\lambda}} = n^{\frac{1-{\lambda}}{{\lambda}}} = n^{1/2} \qquad\text{when}\qquad {\lambda} = \frac{2}{3},$

so it relates to long-known bounds involving ${\Omega(\sqrt{n})}$ samples for uniform distribution. The inequalities in their analysis, however, involve other values of the norm power ${{\lambda}}$. For ${0 \leq {\lambda} \leq 1}$, the following inequality for non-negative numbers ${X_1,\dots,X_n}$ is equivalent to the norm being monotone decreasing in ${{\lambda}}$:

$\displaystyle \sum_{j=1}^n X_j^{\lambda} \geq\;\; \left(\sum_{j=1}^n X_j\right)^{\lambda}. \ \ \ \ \ (1)$

If we raise both sides to the power ${{\lambda}' = 1/{\lambda}}$ we see what happens for ${{\lambda}' > 1}$:

$\displaystyle \left(\sum_{j=1}^n X_j^{\lambda}\right)^{{\lambda}'} \geq\;\; \sum_{j=1}^n X_j,$

so with ${x_j = X_j^{\lambda}}$, that is ${X_j = x_j^{{\lambda}'}}$, we get the equivalent form

$\displaystyle \left(\sum_{j=1}^n x_j\right)^{{\lambda}'} \geq\;\; \sum_{j=1}^n x_j^{{\lambda}'}.$

Thus the original form (1) for ${{\lambda} \leq 1}$ captures all the needed information, while the form for ${{\lambda}' \geq 1}$ is a kind of dual. This also shows the idea of substituting powers ${x_j^{a}}$ for ${X_j}$.

When we have a second non-negative sequence ${Y_1,\dots,Y_n}$ we can give Otto Hölder’s generalization for ${0 \leq {\lambda} \leq 1}$ of Cauchy-Schwarz, which is the case ${{\lambda} = 1/2}$:

$\displaystyle \left(\sum_{j=1}^n X_j\right)^{\lambda} \left(\sum_{j=1}^n Y_j\right)^{1-{\lambda}} \geq\;\; \sum_{j=1}^n X_j^{\lambda} Y_j^{1-{\lambda}}. \ \ \ \ \ (2)$

We can obtain dual forms here too, and it helps to consider substitutions of the form ${X_j = x_j^a y_j^b}$ and similarly for ${Y_j}$. Doing so, and dividing out the right-hand sides, we obtain the basic forms for ${L_p}$ and Hölder with ${0 \leq \lambda \leq 1}$:

$\displaystyle \left(\sum_j x_j^{a\lambda} y_j^{b\lambda}\right)\left(\sum_j x_j^a y_j^b\right)^{-\lambda} \geq\;\; 1$

$\displaystyle \left(\sum_j x_j^{a'} y_j^{b'}\right)^{\lambda} \left(\sum_j x_j^{a''} y_j^{b''}\right)^{1-{\lambda}}\left(\sum_j x_j^{a'\lambda + a''(1-\lambda)}y_j^{b'\lambda + b''(1-\lambda)}\right)^{-1} \geq\;\; 1.$

## What They Prove

Let’s just state their main auxiliary result. Specifically, they determine for all length-${r}$ sequences ${(a_i)}$, ${(b_i)}$, and ${(c_i)}$ of rational numbers whether the inequality

$\displaystyle \prod_{i=1}^{r} \left(\sum_{j=1}^n x_{j}^{a_{i}}y_{j}^{b_{i}} \right)^{c_{i}} \ge\;\; 1. \ \ \ \ \ (3)$

holds for all ${n}$ and all non-negative real ${n}$-vectors ${(x_j)}$ and ${(y_j)}$. For example, taking ${r = 3}$, ${a = (1, 0, 1/2)}$, ${b = (0, 1, 1/2 )}$, and ${c = ( 1/2 , 1/2 , 1)}$ gives us the Hölder inequality with ${\lambda = 1/2}$, that is, Cauchy-Schwartz.

Theorem 2 The inequality (3) holds if and only if the left-hand side can be expressed as a finite product of positive powers of instances of the left-hand sides of the basic ${L_p}$ and Hölder inequalities.

Moreover, there is an algorithm that in time polynomial in ${r}$ and the maximum bits in any ${a_i}$, ${b_i}$, or ${c_i}$ will output the terms of such a product if the inequality is true, or a description of a counterexample ${(n,\vec{x},\vec{y})}$ if it is false.

The proof uses variables standing for the logarithms of sums of the form ${\sum_j x_j^a y_j^b}$ and takes logs of the inequalities to make linear constraints in these variables with non-negative values. The key idea is that if a derived objective function has a nonzero optimum then its ${(x_j),(y_j)}$ argument (and a neighborhood of it) gives counterexamples, whereas if the optimum is ${0}$ then the dual program comes into play to yield the desired product. The bound ${n}$ on ${j}$ is not explicitly represented, but it comes out in counterexamples.

The process is so nicely effective that they are able to represent it by a human-friendly game on grid-points in the plane that is somewhat reminiscent of (and easier than) the solitaire peg-jumping game—well worth a look. They say:

Our characterization is of a non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought through trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will be useful to others, and facilitate analyses like the one here.

In particular, it helped them with inequalities involving that “hairy” number ${2/3}$ which they used to prove their main results about sampling.

## Why Isn’t it Exponential?

The short answer to why their algorithm runs in polynomial time is that it involves linear programming. There isn’t a tight analogy between their game moves and basic pivot moves of the simplex algorithm, however, because there are cases where the latter takes exponential time. It is unclear to us whether they are using a subcase of linear programming that stops short of being P-complete, or whether this will lead to new P-complete problems involving inequalities. They do have an even more general form with (${d+1}$)-many ${r}$-sequences and summands ${x_{j,k}^{(\cdot)}}$ for ${k = 1}$ to ${d}$ for any ${d \geq 2}$, in place of the elements ${(a_i),(b_i),(c_i)}$ and ${(x_j,y_j)}$ for ${d = 2}$.

Most notable is that the minimum dimension ${n}$ of a counterexample can be doubly exponential in the bit-size of the arguments, and yet the algorithm can still describe it. This happens even for ${d=1}$ as the note: Consider the inequalities

$\displaystyle g_{\epsilon}(\vec{x}) = \left(\sum_j x_{j}^{-2}\right)^{-1} \left(\sum_j x_{j}^{-1}\right)^{3} \left(n\right)^{-2-\epsilon} \left(\sum_j x_{j}^{+1}\right)^{3}\left(\sum_j x_{j}^{+2}\right)^{-1} \geq\;\; 1.$

The middle factor is legal since ${n = \sum_j x_j^0}$. This is true for ${\epsilon = 0}$ but false for any ${\epsilon > 0}$ with ${n}$ growing exponentially in ${E = 1/\epsilon}$. Yet counterexamples have succinct descriptions—as they note, taking ${n' = 64^E}$ makes it false with

$\displaystyle \vec{x} = (n',\frac{1}{n'},1,1,\dots,1).$

There are ${n'}$-many trailing ${1}$s, and they make ${n = n'+2}$ large enough for the middle ${n^{-2-\epsilon}}$ term to overpower the rest. Put another way, the domain formed by their constraints is regular enough that extrema and specific points in their neighborhoods have short descriptions.

We close with a speculation about duality and dimension. The ${L_p}$ inequalities become equalities only when ${n = 1}$. The Hölder inequalities become equalities only when ${\vec{X}}$ is proportional to ${\vec{Y}}$—that is, they are the same projective point. We wonder if their algorithm can point the way to a useful tradeoff between homogeneity and dimension, which might help tame cases where the dimension is large.

## Open Problems

Can you find further applications for their result on inequalities? Can the class of inequalities that can be given an “automatic” treatment be extended further? For one instance we mention the Chebychev sum inequality

$\displaystyle \sum_k a_k b_k \geq \frac{1}{n} \left(\sum_k a_k\right)\left(\sum_k b_k\right),$

which depends critically on the monotone orderings ${a_1 \leq a_2 \leq \dots a_n}$ and ${b_1 \leq b_2 \leq \dots b_n}$.