Newman’s theorem on rational approximations and complexity theory

Donald Newman was not a theorist, but was a mathematician who worked on many topics during his career. One of his results is a lovely theorem that shows that the approximation of continuous functions by rational functions can be very different from the approximation by polynomials.

Today I want to talk about this famous theorem, and discuss some applications to theory. If you do not know this theorem you may enjoy seeing it. Even if you do know it, you may enjoy seeing some thoughts I have on a potential application of this theorem.

I never had the honor of meeting Newman, but I just read a nice tribute to him, while preparing this, which contained a number of interesting stories. One explained how he once introduced the great Paul Erdös to a seminar audience in a “novel” way—you will have to read it yourself. Another was on Newman’s performance on the Putnam mathematics competition—he was clearly a very fast and powerful problem solver. I quote:

“Ordinarily, six or seven solutions are sufficient for a student to win the competition. In his freshman year at CCNY, Newman won the Putnam with nine correct solutions. As a sophomore, he again won the competition with an unprecedented perfect 12 solutions, a fete (sic) he repeated in his junior year. He didn’t bother taking the exam as a senior.”

I recall taking the Putnam exam as a freshman—I did okay—nothing like Newman. I did worse as a sophomore, even worse as a junior, and finally was thrown off the team as a senior. For me, the more mathematics I learned the worse I got. Oh well.

I will not discuss whether or not the “world is digital or not.” I think the last post generated a lot of fun comments—many at my expense. I make mistakes and perhaps that post was one. If so I apologize to you all; I hope that the many comments, while most disagreed with me, were interesting to you. I somehow did not get my point across. So I will fold up my tent on that problem for a bit—no pun intended—and move on to Newman’s classic theorem. Wait, I now see how to explain my point ${\dots}$ stop, I better not go there today.

Newman’s Theorem

It has long been known that any continuous function on the interval ${[-1,1]}$, or any compact interval will do, can be approximated arbitrarily well by a polynomial. There are many proofs of this fundamental fact, many refinements, and many generalizations.

The function ${|x|}$, the absolute value of ${x}$, while continuous on the interval ${[-1,1]}$ is not easy to approximate by a polynomial. To get to an error bound of roughly ${1/t}$ requires a polynomial of degree ${\Omega(t)}$. The reason, at least the intuition, is that the function ${|x|}$ has a sharp corner: at zero its derivative is undefined. This behavior makes it hard for a low degree polynomial to “fit” the absolute value function.

What Newman did in 1962—the paper was published in 1964—is to prove that one can do exponentially better with rational functions instead of polynomials in approximating the absolute value function. I am not an expert in approximation theory, but I believe that this result came as a surprise. See Herbert Stahl’s paper, for instance. Another surprise in mathematics.

Let’s now turn to Newman’s Theorem. We need some notation first. Let ${a = \exp(-1/{\sqrt n})}$ and ${p(x) = \Pi_{k=0}^{n-1}(a^{k}+x)}$. Then,

Theorem: For all ${x \in [-1,1]}$ and ${n \ge 5}$,

$\displaystyle \mid |x| - r_{n}(x) \mid \le 3\exp(-{\sqrt n})$

where ${r_{n}(x)}$ is the rational function,

$\displaystyle x \frac{p(x)-p(-x)}{p(x)+p(-x)}.$

The theorem is remarkable in a number of respects: First, it shows that the degree of the approximating rational function can be much smaller than the degree of any approximating polynomial. Previously, to get an error bound of order ${1/t}$ a polynomial had to have degree order ${t}$; now the degrees of the numerator and denominator are only order ${\log^{2} t}$. This is a huge decrease. Second, the rational function is not just constructive, but is expressible in a simple closed form. Very neat.

Both of these are wonderful for applications—as we will discuss in the next section.

Applications

I knew the Newman theorem back when I was a graduate student, yet I never could find an application of it to some theory problem. The theorem seemed to be perfect: great error bound and constructive. Yet I never found any “use” of his theorem.

It took a great insight from Richard Beigel, Nick Reingold, and Daniel Spielman to realize that Newman’s theorem could be used to solve an important open problem from complexity theory. They showed in a brilliant paper that the complexity class PP is closed under intersection and union. This is one of the great proofs in complexity theory. It is very sad that Reingold passed away about a year ago.

Later Mitsunori Ogihara used similar ideas to prove that the PL “hierarchy” collapses. There are now other uses of Newman’s theorem: for example, it is used in parts of learning theory. However, I will not attempt an exhaustive survey of its applications.

The PP result solved a well known open problem that had been raised as soon as PP was defined as a complexity class. Recall PP is the complexity class that contains sets that are accepted by a polynomial-time bounded probabilistic Turing machine that accepts with probability ${>0.5}$. It trivial that this class is closed under complements, but closure under intersection/union is not obvious at all.

The PP Proof

Here is a high level view of their proof. They need to detect the sign of an integer: the sign of ${-11}$ is ${-1}$ and the sign of ${123}$ is ${1}$. The problem is that they need to do this with a low degree polynomial, which is not obvious how to do it. Newman’s theorem comes to the rescue.

Suppose that there is a rational function ${r(x)}$ that approximates ${|x|}$ well, which of course there is by Newman’s theorem. Then, for ${x}$ not too near ${0}$, ${r(x)/x}$ must well approximate the sign of ${x}$. The problem is that their Turing machine can “compute” polynomials, but not rational functions—they cannot divide in their model. So they use the following cool, but trivial, trick:

$\displaystyle a/b > 0 \text{ if and only if } ab > 0.$

There is much more to their proof, but this is the key use of Newman’s theorem. They do need to modify his function to one that works over the integers in an exponential size range. Here are the polynomials they use; note, as is often the case in computer science not only degree but the size of coefficients is also important. In particular they use the following polynomials:

$\displaystyle \begin{array}{rcl} P_{n}(x) &=& (x-1)\displaystyle{\prod_{i=1}^{n}(x-2^{i})^{2}}, \\ S_{n}(x) &=& \frac{P_{n}(-x) - P_{n}(x)}{P_{n}(-x) + P_{n}(x)}, \\ A_{n}(x,y) &=& S_{n}(x) + S_{n}(y) -1. \end{array}$

Some of the key properties are:

1. The degree of ${A_{n}(x,y)}$ is ${O(n)}$;
2. Each coefficient of ${A_{n}(x,y)}$ is ${2^{O(n^{2})}}$;
3. If ${x,y}$ are integers in ${[1,2^{n}]}$, then ${A_{n}(x,y) > 0}$;
4. If ${x,y}$ are integers in ${[1,2^{n}]}$, then ${A_{n}(-x,y) < 0}$, ${A_{n}(x,-y) < 0}$, and ${A_{n}(-x,-y) < 0}$.

Matrices

I have several open problems that concern the behavior of matrices. I noticed that the following simple theorem can be proved using Newman’s theorem. This is likely to be known, but I will present the statement here with an outline of the proof.

Lemma: Suppose that ${\mid f(x) -g(x) \mid \le \delta }$ for all ${x}$ in the interval ${[-1,1]}$, where ${f(x)}$ and ${g(x)}$ are polynomials. Suppose also that ${A}$ is an ${n}$ by ${n}$ matrix with all its eigenvalues in ${[-1,1]}$. Then,

$\displaystyle \mid \mathsf{trace}(f(A)) - \mathsf{trace}(g(A)) \mid \le n\delta.$

Recall, ${\mathsf{trace}(A)}$ is the sum of the diagonal entries of the matrix ${A}$.

What is lemma says is: if an inequality holds for an interval that contains the eigenvalues of a matrix, then a related inequality holds for the traces of the matrices.

The proof in the case that ${A}$ is diagonalizable is fairly straightforward. Suppose that ${A = S^{-1}DS}$ where ${D}$ is a diagonal matrix. Then, the key insight is that

$\displaystyle \begin{array}{rcl} \mathsf{trace}(f(A)) &=& \mathsf{trace}(f(S^{-1}DS)) \\ &=& \mathsf{trace}(S^{-1}f(D)S) \\ &=& \mathsf{trace}(f(D)). \end{array}$

The last equality follows from a basic property of the trace of a matrix.

$\displaystyle \mathsf{trace}(f(A)) - \mathsf{trace}(g(A)) = \mathsf{trace}(f(D)) - \mathsf{trace}(g(D)).$

The lemma then follows by applying the inequality on ${f(x)}$ and ${g(x)}$ to each entry of the diagonal matrix ${D}$; the bound is clearly ${n}$ times ${\delta}$.

The proof in the case that ${A}$ is not diagonalizable follows by a trick that I believe is due to Richard Bellman. His insight is that given any ${A}$ there is a ${A^{(k)}}$ that is diagonalizable and is within ${1/k}$ of ${A}$. Then, one lets ${k}$ tend to ${\infty}$ and concludes the result by continuity. This trick, for example, gives a very short proof of the fact that any matrix satisfies its own characteristic polynomial. First prove this for diagonalizable matrices, which is simple, and then use the Bellman trick.

Theorem: Suppose that ${A}$ is an ${n}$ by ${n}$ matrix with all its eigenvalues in the interval ${[-1,1]}$, and with ${n \ge 5}$. Let ${r_{n}(x)}$ be the rational function from Newman’s theorem. Then,

$\displaystyle \mid \mathsf{trace}(r_{n}(A)) - S \mid \le 3n\exp(-{\sqrt n})$

where ${S}$ is the sum of the absolute values of the eigenvalues of ${A}$.

The proof follows from the lemma and Newman’s theorem using ${f = r_n}$ and the absolute value function for ${g}$.

Open Problems

An obvious open problem is to find other applications of Newman’s theorem. Can you use the matrix version? I had an idea to use the matrix version on a learning problem, but that has not worked out yet. So I will get back to that in the future.

October 8, 2009 9:09 am

On this topic, I can never resist pointing out the following:

The result Beigel-Reingold-Spielman needed for PP, namely a good rational approximation to the function sgn(x) on the range [-1, -epsilon] union [epsilon, 1] was solved *optimally* by Yegor Zolotarev in 1877 (!!). By optimally, I mean for each degree d and each epsilon, he explicitly gave *the* best degree-d rational function for approximating sgn(x) on [-1, -epsilon] union [epsilon, 1], in L_infinity distance! (Sadly, he was run over by a train the next year at age 31.)

I guess people weren’t much for studying asymptotics back then, so Zolotarev didn’t work them out. However Naum Akhiezer worked them out in 1929 (“On a problem of E. I. Zolotarev”). In particular, as Dick mentions it’s easy to deduce results for approximating |x| from results on approximating sgn(x), and indeed Akhiezer’s paper explicitly includes Newman’s Theorem, that the best degree-d approximation to |x| on [-1,1] has error exp(-Theta(sqrt(d))).

Newman was apparently unaware of Akhiezer’s paper.

October 8, 2009 11:14 am

The first time I saw an application of this approximation was in the dissertation work of Jim Hoover on feasible computation over the reals (journal version in [SICOMP vol 19, No. 1]) which predated BRS.

Ko and Freedman had defined a notion of feasible real computation by modifying Turing’s definition of computable real number as given by a TM that on input n spits out a rational approximation of the number within 2-n. Function computation is given by an oracle TM that takes a TM giving a real-number and produces an n-bit approximation. Ko and Freedman said that for feasibility the time to spit out an n-bit approximation had to be polynomial in n. They had some neat results such as the fact that there are very spiky feasible real functions whose integral computation is #P-hard to compute.

A problem with this definition is that the model doesn’t conform to what numerical analysts do for real computation. They just add, subtract, multiply, and divide (rarely) their inputs and test for approximate equality (never exact equality). Part of this was the motivation for Blum, Shub, and Smale to produce their real computation model. However, a major unrealistic aspect of the BSS model is that they included exact equality tests, something that numerical analysts never do. (In fact, in Turing’s model, tests for equality are uncomputable and a similar prohibition holds for the Ko and Freedman feasible version.)

Independent of BSS and around the same time, Jim Hoover came up with a model for real computation that really seems to capture a lot of how numerical analysts compute. The notion is that of a uniform polynomial-size arithmetic circuit family (+,-,*,/), one for each interval of the form [-2n,2n] that has one other restriction, namely that the value of any internal gate is such a circuit can be at most exp(nO(1)). It correctly computes f if the n-th circuit computes an n-bit approximation to f on the interval. He called these feasible-size-magnitude arithmetic circuits. (Note that these circuits might compute rational functions of exponential degree.)

Surprisingly, this turned out to be completely equivalent to the Ko-Freedman definition. Moreover, one could remove all division operations without loss of generality despite the potential for exponential degree – a case not covered by Strassen’s result. A key to the proof of this equivalence is the surprising quality of the rational approximation of absolute value which allows a feasible-size-magnitude arithmetic circuit to do a good job at approximating the extraction of bits from its input.

October 8, 2009 1:20 pm

Prof. Beame: that’s a very interesting model! I should read that paper. However, that circuit model doesn’t in fact consider approximate equality tests (which numerical people do use, as you said…) Is there an obvious argument that allowing single-step equality up to any fixed constant (say, a rational constant with polynomial bits?) doesn’t add power?

October 9, 2009 10:50 pm

Is there an obvious argument that allowing single-step equality up to any fixed constant (say, a rational constant with polynomial bits?) doesn’t add power?

It is a corollary of the equivalence with the Ko-Friedman model. Beyond that I am not sure.

October 8, 2009 11:18 am

WordPress seems to have stripped out my “sup” tags so my exponents don’t show up as such. They should be [-2^n,2^n] and exp(n^O(1)).

October 8, 2009 1:47 pm

There is a great quote that cannot find to the effect:

The credit goes to the person that discovers something the LAST.

I think that is what is going on here…dick

5. October 9, 2009 12:25 am

This reminds me a lot of the problem presented in the Factoring and Factorials post. I looked into the approximation I made more, and it looked like it needed n/logn terms of the Taylor series to get to n bits of precision, which is too many operations to factor in polynomial time. It turns out you can get the Taylor series by exponentiating the Stirling series, but it’s not much use if it takes n/logn terms.

However, if there’s a way to get n bits of precision with only (logn)^c terms in the numerator and denominator of a rational function, that’d solve the problem, and factoring could be done in polynomial time. I wonder how one would go about looking for such a formulation…

October 9, 2009 7:18 am

Interesting connection. I have no idea if will work, but like the idea.

• December 23, 2012 11:50 pm

I do not believe an approximation works however small the error may be. It has to be exact. consider +/- 10^-r error. When you use a modular circuit you need to rely on rational approximations.The denominator to the approximation may have an inverse as large as the number( xample N=pq) you have to modulo with. This blows up the small error in the real computation model. However it is not clear if the denominators can be controlled (easily) say by fine tuning the approximations.

• December 23, 2012 11:51 pm

I do not believe an approximation works however small the error may be. It has to be exact. consider +/- 10^-r error. When you use a modular circuit you need to rely on rational approximations.The denominator to the approximation may have an inverse as large as the number( example N=pq) you have to modulo with. This blows up the small error in the real computation model when brought into the realm of modular world and modular inverse are taken for the approximations. However it is not clear if the denominators can be controlled (easily) say by fine tuning the approximations.

• December 24, 2012 12:07 am

that blow up seems to happen or we need a lot of fractional bits to work with for all the intermediate steps (even if the final result is within 10^-r).

October 9, 2009 1:37 am

You probably know Scott Aaronson found a very cute proof of the Beigel-Reingold-Spielman theorem using quantum postselection: http://scottaaronson.com/papers/pp.pdf

October 9, 2009 7:15 am

Yes I do. Thanks for the pointer. Another example to add to the Quantum Method list.

June 30, 2010 5:54 pm

Hello.

What if, in a general theory of everything kind of way, some singular human conscious had used simple Yes(I should feel guilty) or No(I do not feel guilty) answers to answer every moral question that he could remember that had ever occurred in his life. In this way he could have become completely pure. He would have truly asked himself all the questions that could be asked of him. He would have emotionally evolved back to when he was a child.

What if he then assigned an ‘It doesn’t matter as life is only made to make mistakes’ label on every decision that he had analysed.

This would not make him God or the devil, but just very still and very exhausted. Anybody can do this but just for the purpose of this experiment lets say I have done this. Which I have.

There are no fears in me and if I died today I could deal with that because who can know me better than myself? Neither God or the Devil. I am consciously judging myself on ‘their’ level.

To make this work, despite my many faults, take ME as the ONLY universal constant. In this sense I have killed God and the Devil external to myself.The only place that they could exist is if I chose to believe they are internal.

This is obviously more a matter for a shrink more than a mathematician, but that would only be the case depending on what you believed the base operating system of the universe to be. Math / Physics / morals or some other concept.

As long I agreed to do NOTHING, to never move or think a thought, humanity would have something to peg all understanding on. Each person could send a moral choice and I would simply send back only one philosophy. ‘ LIFE IS ONLY FOR MAKING MISTAKES’.

People, for the purpose of this experiment could disbelief their belief system knowing they could go back to it at any time. It would give them an opportunity to unburden themselves to someone pure. A new Pandora’s box. Once everyone had finished I would simply format my drive and always leave adequate space for each individual to send any more thoughts that they thought were impure. People would then eventually end up with clear minds and have to be judged only on their physical efforts. Either good or bad. It would get rid of a lot of maybes which would speed lives along..

If we then assume that there are a finite(or at some point in the future, given a lot of hard work, there will be a finite amount) amount of topics that can be conceived of then we can realise that there will come to a time when we, as a people, will NOT have to work again.

Once we reach that point we will only have the option of doing the things we love or doing the things we hate as society will be completely catered for in every possible scenario. People will find their true path in life which will make them infinitely more happy, forever.

In this system there is no point in accounting for time in any context.

If we consider this to be The Grand Unified Theory then we can set the parameters as we please.

This will be our common goals for humanity. As before everyone would have their say. This could be a computer database that was completely updated in any given moment when a unique individual thought of a new idea / direction that may or may not have been thought before.

All that would be required is that every person on the planet have a mobile phone or access to the web and a self organising weighting algorithm biased on an initial set of parameters that someone has set to push the operation in a random direction.

As I’m speaking first I say we concentrate on GRAINE.

Genetics – Robotics – Artificial Intelligence – Nanotechnology and Zero Point Energy.

I have chosen these as I think the subjects will cross breed information(word of mouth first) at the present day optimum rate to get us to our finishing point, complete and finished mastery of all possible information.

Surely mastery of information in this way will lead us to the bottom of a fractal??? What if one end of the universes fractal was me??? What could we start to build with that concept???

As parameters, we can assume that we live only in 3 dimensions. We can place this box around The Milky Way galaxy as this gives us plenty of scope for all kinds of discoveries.

In this new system we can make time obsolete as it only making us contemplate our things that cannot be solved because up to now, no one has been willing to stand around for long enough. It has been holding us back.

All watches should be banned so that we find a natural rhythm with each other, those in our immediate area and then further afield.

An old clock watch in this system is should only be used when you think of an event that you need to go to. It is a compass, a modern day direction of thought.

A digital clock can be adapted to let you know where you are in relation to this new milky way boxed system.(x,y,z).

With these two types of clocks used in combination we can safely start taking small steps around the box by using the digital to see exactly where you are and then using the old watch to direct you as your thoughts come to you.

We can even build as many assign atomic clocks as are needed to individually track stars. Each and every star in the Milky Way galaxy.

I supposed a good idea would be to assume that I was inside the centre of the super-massive black hole at the centre of the galaxy. That would stop us from being so Earth centric.

We could even go as far as to say that we are each an individual star and that we project all our energies onto the super-massive black hole at the centre of the galaxy.

You can assume that I have stabilized the black hole into a non rotating perfect cube. 6 outputs /f aces in which we all see ‘the universe and earth, friends’ etc. This acts like a block hole mirror finish. Once you look it is very hard to look away from.

The 6 face cube should make the maths easier to run as you could construct the inner of the black hole with solid beams of condensed light(1 unit long) arranged into right angled triangles with set squares in for strength.

Some people would naturally say that if the universe is essentially unknowable as the more things we ‘discover’ the more things there are to combine with and therefore talk about. This is extremely fractal in both directions. There can be no true answers because there is no grounding point. Nothing for the people who are interested, to start building there own personal concepts, theories and designs on.

Is it possible that with just one man becoming conscious of a highly improbable possibility that all of universes fractals might collapse into one wave function that would answer all of humanities questions in a day? Could this be possible?

Answers to why we are here? What the point of life really is et al?

Is it possible that the insignificant possibility could become an escalating one that would at some point reach 100%???

Could it be at that point that the math would figure itself out so naturally that we would barely need to think about it. We would instantly understand Quantum theory and all.

Can anybody run the numbers on that probability?

8. December 23, 2012 11:44 pm

I have a log(r) arithmetic operations (no divisions) to approximate r! within $+/- O(10^-r)$ in the real computation model (unit multiplication, addition cost as in BSS model). It is fairly elementary. Is such a result known and can it be useful?

9. October 15, 2014 11:30 am

is there a multiple variable version of the theorem?