A helpful analogy for the quantum computing debate?

Roman Smolensky was a complexity theorist—unfortunately “was” is the correct word here. He passed away in 1995, way too young, at the age of 35. His STOC 1987 paper “Algebraic methods in the theory of lower bounds for Boolean circuit complexity” is a classic, and one can only imagine what he might have done had he lived longer.

Today I, Ken, wish to explain a well known result of Smolensky’s about low-level complexity theory. Then I draw an analogy we hope may help in understanding mathematical issues underlying our current debate over scalable quantum computers.

Dick raises a principle of mathematics that is not well known, or not as well known as it should be: when it is hard to work with sets, replace them by distributions. Even though distributions are a more general notion, they have some nicer properties. Properties of distributions over qubits and their errors are at the heart of the debate, so we feel noting some properties of distributions on ordinary bits or vectors over a finite field, and raising some questions, may be a helpful stepping stone.

Smolensky considered bounded (or sub-logarithmic) depth circuits of unbounded fan-in ${\mathsf{AND}}$ and ${\mathsf{OR}}$ gates (with ${\mathsf{NOT}}$ gates at inputs), or equivalently circuits of ${\mathsf{NAND}}$ or ${\mathsf{NOR}}$ gates, together with unbounded fan-in gates telling whether the number of incoming 1’s is a multiple of a prime ${p}$. He proved that they cannot compute the mod-${q}$ function unless ${q}$ is a power of ${p}$, and indeed cannot even approximate it in a sense whose dependence on distributions we wish to bring out. Let us first turn to Smolensky’s work, in the particular case of parity, i.e., ${q = 2}$ and ${p}$ is an odd prime.

Approximating NANDs

As Smolensky’s paper acknowledges, the following lemma on approximating ${\mathsf{NAND}}$ by low-degree polynomials was also proved by Alexander Razborov for ${p = 2}$, and by David Mix Barrington for all ${p}$:

Lemma 1 For any distribution ${D}$ on ${\{-1,1\}^m}$, and ${k > 0}$, there exists a polynomial ${h(x_1,\dots,x_m)}$ of degree at most ${(p-1)k}$ over ${\mathbb{Z}_p}$ such that

$\displaystyle \Pr_{y \leftarrow D}[h(y) \neq \mathsf{NAND}(y)] \leq \frac{1}{p^k}.$

Proof: Consider polynomials of the form ${g = 1 - 2\cdot h^{p-1}}$ where

$\displaystyle h = c_1\frac{y_1 + 1}{2} + c_2\frac{y_2 + 1}{2} + \cdots + c_m\frac{y_m + 1}{2}$

with coefficients ${c_1,\dots,c_m \in \mathbb{Z}_p}$.

If ${y = (y_1,\dots,y_m)}$ makes ${\mathsf{NAND}(y)}$ false, then since ${-1}$ stands for true, ${h(y_1,\dots,y_m) = 0}$, and so ${g(y) = 1}$ which stands for false.

Otherwise, some ${y_i = 1}$, and thus ${h}$ has an additive term that is simply ${c_i}$ by itself. Whatever the value of the rest of ${h}$, there is exactly a ${(p-1)/p}$ probability that a uniformly random choice of coefficients including ${c_i}$ will produce an ${h}$ such that ${h(y) \neq 0}$. Raising that to the power ${p-1}$ produces ${1}$, and so ${h'(y) = -1}$ which stands for true.

Furthermore, using ${k}$-many sets of ${m}$-many coefficients ${c_{i,j}}$ to define ${h_1,\dots,h_k}$ in the manner of ${h}$, we define

$\displaystyle g' = (-1) + 2\prod_{j=1}^k (1 - h_j^{p-1})$

Then when ${y = (-1,\dots,-1)}$, all ${h_j(y) = 0}$ so ${g'(y) = 1}$ which stands for false, while for all other ${y \in \{-1,1\}^m}$, the chance of making some ${h_j(y)}$ nonzero, so that the whole product is zeroed out and ${g'(y) = -1}$, is ${\Delta = 1 - 1/p^k}$.

Thus for all ${y}$, the probability over the ${c_{i,j}}$ that the resulting ${g'(y)}$ agrees with ${\mathsf{NAND}(y)}$ on that ${y}$ is ${\Delta}$. Picture the ${y}$‘s as rows of a big table and the choices for all ${c_{i,j}}$ as columns. Each row has density ${\Delta}$ of agreements. It follows that regardless of the distribution ${D}$ on the rows, some column has density at least ${\Delta}$ of agreements when ${y}$ is chosen according to ${D}$. $\Box$

The reason for caring about arbitrary distributions on ${y}$ is that the ${y_i}$ will be outputs from ${\mathsf{NAND}}$ gates higher in the circuit, and those values might be correlated. Lecture notes by Alexis Maciel and David Mix Barrington (which I am using in a course) in fact represent the ${y_i}$ as functions ${f_i(X)}$ of a single variable ${X}$. Here ${X}$ is ultimately interpreted as the inputs ${x_1,\dots,x_n}$ to the circuit ${C}$, but it could be arbitrary.

Lower Bound for Parity

The point of the lemma is that if the circuit ${C}$ has small depth ${d}$, then composing the polynomials ${g'}$ obtained for each ${\mathsf{NAND}}$ gate ${g}$ yields a polynomial ${f_C}$ of degree at most ${\delta = k^d(p-1)^d}$ that agrees with ${C}$ on all but an ${s/p^k}$ proportion of inputs ${x_1,\dots,x_n}$, where ${s}$ is the number of gates in ${C}$. When ${d = o(log n)}$, there is some slack to choose ${k}$ non-constant but growing slowly enough with ${n}$ (for some fixed ${b > 0}$ used as below) to make:

$\displaystyle \begin{array}{rcl} \delta &=& o(\sqrt{n}),\\ k &=& \log_2 s + o(n^{1/bd}),\text{ and easily}\\ s/p^k &=& o(1) \end{array}$

where ${s}$ is the size of the circuit. This allows ${s}$ to be as big as ${2^{n^{1/bd}}}$, and this becomes the circuit size lower bound obtained by Smolensky’s argument.

Note that regardless of the distribution on the inputs, there always exists a setting of the ${skm}$-many coefficients ${c_{g,i,j}}$ over all the gates ${g}$ that achieves the bound ${s/p^k}$ on the error probability over that distribution of inputs. In particular, this works for the uniform distribution over the inputs ${(x_1,x_2,\dots,x_n)}$, but does not require it. Thus ${C}$ can be approximated by some polynomials ${r}$ of low degree ${r = o(\sqrt{n})}$.

This sets the stage for Smolensky’s famous “trick” which we give in the case of parity. Over the domain ${B = \{-1,1\}^n}$, parity is just the product ${x_1 x_2 \cdots x_n}$ of all the variables, and ${x_i^2 = 1}$. Thus parity is a polynomial ${\kappa}$ that is “complete” in the following sense defined in his paper:

For every function ${f: \mathbb{Z}_p^n \rightarrow \mathbb{Z}_p}$ there are polynomials ${g,h}$ both of degree at most ${n/2}$ such that

$\displaystyle (\forall x \in B)f(x) = \kappa(x)g(x) + h(x).$

The genius of this is that if ${r}$ agrees with ${\kappa}$ on some subset ${A \subseteq B}$ and has low degree ${\delta}$, then we can bound the size of ${A}$. Namely, every function ${f}$ when restricted further to ${A}$ can be written as a polynomial of degree at most ${\Delta = \delta + \frac{n}{2}}$. There are ${p^{|A|}}$ such functions, but only ${p^{(^n_i)}}$ ways to assign the coefficients of terms of degree ${i}$ for all ${i \leq \Delta}$. Taking logs base ${p}$ we get

$\displaystyle |A| \leq \sum_{i=0}^{\Delta} \left(\mbox{~}^n_i\mbox{\;}\right) \leq (\frac{1}{2} + o(1))2^n.$

To understand the right-hand inequality intuitively, note that the standard deviation of ${n}$ coinflips is ${\Theta{\sqrt{n}}}$, so having ${\Delta}$ be only an ${o(\sqrt{n})}$ deviation above ${\frac{n}{2}}$ makes the overall density of ${A}$ only vanishingly above one-half. Thus ${r}$ and ${\kappa}$ can agree on barely over half the inputs in ${\{-1,1\}^n}$, making ${\kappa}$—that is, parity—inapproximable by any small-depth, sub-exponential size circuits ${C}$. A tour-de-force.

Note that the agreement set ${A}$ is implicitly talking about correlation with the uniform distribution on ${B}$. We would be interested in treatments that consider non-independent distributions on ${B}$, and note the presentation of Smolensky’s theorem by Emanuele Viola in this paper in that regard.

The Analogy

The analogy is that the proof technique—for the Lemma—does not require the distribution over the inputs ${y}$ to be uniform. Indeed they can be joint functions of some “hidden variable” ${X}$, and be in that sense “entangled.” More concretely, they can be arbitrarily strongly correlated. The proof technique copes with this lack of entropy in ${y}$ by creating its own entropy in the ability to fit the coefficients ${c_{i,j}}$ when taking a low-degree polynomial that succeeds with high probability. Although the theorem is usually presented assuming uniform distribution over the inputs ${x_1,\dots,x_n}$, this is not actually needed for the Lemma.

The analogy with our quantum debate is made by asking, do fault-tolerance schemes work more like the Lemma, or do they really need independence? At the risk of straining what we are trying to say, we can draw out the analogy even further:

${\bullet}$ “Model of Noise”: A class of distributions ${D}$ on the inputs to the circuit ${C}$, or on any ${\mathsf{NAND}}$ gate of ${C}$.

${\bullet}$ “Correlated/Entangled Errors”: ${D}$ is the image distribution under mappings ${y_1 = f_1(X)}$, ${y_2 = f_2(X)}$, ${\dots}$, ${y_m = f_m(X)}$ of a distribution ${E}$ on a set ${X}$, where ${E}$ may have lower entropy/fewer degrees of freedom. Note that gates lower down in the circuit may have a greater degree of correlation resulting from the computation by gates above them.

${\bullet}$ “Reasonable Noise Model”: The distribution ${D}$ is not too complex, such as being ${\mathsf{P}}$-samplable, or is natural in some other sense. For instance, ${D}$ could be induced from some distribution ${D'}$ on co-ordinates ${(y_1,\dots,y_m,z_1,\dots,z_s)}$ by some operation on the ${z_i}$ co-ordinates, such as projection or “tracing out.”

${\bullet}$ “Feasible States”: The coefficients ${c_{i,j}}$ that achieve the desired approximations are easy to compute, for instance ${\mathsf{P}}$-uniform.

The main lack in this analogy is that the complexity-class side has no notion of “noise” per se or “errors” or “correction.” But we offer it nevertheless because we see some similarity in the mathematical issues. We also would like to ask whether the feasible-construction issues have been addressed as-such by complexity theorists.

We have a further reason for thinking along lines of distributions, a.k.a. measures, in a related context. A survey paper by Akshay Venkatesh of Stanford about work on the Littlewood conjecture has some interesting examples and declares straight out:

Measures are often easier to work with than sets.

In particular, on pages 12–13 he gives an example of a statement about sets that is “wishful thinking” for sets, but which he can “salvage” in terms of distributions. We intend to elaborate on this point soon.

Open Problems

What is the complexity of constructing the polynomials ${g}$ that approximate a ${\mathsf{NAND}}$ gate, or a circuit of ${\mathsf{NAND}}$ gates, in terms of the complexity of the distribution ${D}$ on their inputs? Under what conditions are the resulting polynomials computable jointly in uniform polynomial time? When are they succinct?

Why does Smolensky’s argument fail to give a super-polynomial size lower bound for constant-depth circuits with ${\mathsf{MOD}_{15}}$ gates that compute parity? Can we give a better, deeper answer than, “because ${\mathbb{Z}_{15}}$ is not a field”?

[fixed formula for g-prime]

1. March 11, 2012 8:37 am

Roman Smolensky was a postdoc at the Hebrew University of Jerusalem at the early 90s and it was always a pleasure to talk with him. We often met and chatted at the Belgium house (our faculty club). The Razborov-Smolensky method for lower bounds for bounded depth circuits with MOD gates is indeed very beautiful and important. Indeed, it is a major mystery what to do for the non prime power case. (There are also some important mysteries in the prime case as well.)

In the context of Mobius randomness that was mentioned in several recent posts, it is still a major open problem to show that the Mobius function is nearly orthogonal to functions that can be computed by bounded depth Boolean circuits with mod 2 gates. By Razborov-Smolensky THM, we need to show that the Mobius function is asymptotically orthogonal to low degree (degree polylog) polynomial over GF(2). The case of degree one polynomials is precisely the case of Walsh functions which is crucial for “polylog frequency” in Green’s proof for AC0-Mobius randomness, and was settled for general Walsh functions by Bourgain. But even degree-two polynomials over GF(2) represent a very difficult open problem which includes the (yet, unknown) Mobius randomness of several famous sequences arising in dynamics.

Regarding QFT, what is required for quantum fault tolerance is much weaker than independence.On the other hand, noise models that fail QFT can also be very simple. In the more detailed proposed models the noise, in some sense, “mimics” the computation itself. At some early times I was interested how the noise that interests me can arise from very simple circuits, but I did not persue this direction.

March 11, 2012 11:43 pm

Hello Professor Ken,

With respect to your introductory paragraph, I would like to ask a question.

I would like you to illustrate the principle when it is hard to work with sets, replace them by distributions by a few (possibly simple) examples. If possible, can I add in a request for you to include more than one example so that I can appreciate this principle better.

Thanks a lot for your time and please allow me to apologize for inconvenience if any.

• March 11, 2012 11:56 pm

Thanks for the interest—I’m limited only by finite-time. We do intend to make a later post on this, as stated in regard to the cited survey by Akshay Venkatesh at the end.

An example that’s maybe too simple to be on-target is that I originally conceived the chess anti-computer-cheating application (which you can find here and in papers here) as being one of distance between distributions. Three parties are represented as distributions to compare: the computer chess program, the player, and the model’s idealization of the player. The player is formally denoted by the set of played moves, but it was convenient to treat this as the ensemble of distributions giving 1 on the played move and 0 on the others, over all turns in the player’s games. (Later on I switched to a simpler frequentist model.)

My actual motive on the non-quantum side was the opposite: to ask how the Barrington/Razborov/Smolensky theorems might usefully be generalized to distributions over the inputs, with the technical obstacle that you may no longer have the simple binomial counting estimate that worked with the domain A of agreement treated as a set.