Danger of hidden constants

Godfrey Hardy, one of the greatest British mathematicians of the last century, wrote hundreds of research papers, and the wonderful “An Introduction to the Theory of Numbers” with E. M. Wright. Many of us learned the O-notation first from their book. In any event the O-notation appears in almost all theory papers. Many attribute the notation–incorrectly–to Edmund Landau: he popularized it but did not invent it. In any event we use O-notation all the time. It’s O-this and O-that.

## Use and Misuse of Big O-Notation

The problem with O-notation is simple. Suppose that we prove that some algorithm $A$ runs in time $O(n\log n)$. Is this better or worse than another algorithm $B$ that runs in time $O(n^{2})$? It depends. In the long run, clearly, algorithm $A$ will be faster. The economist John Maynard Keynes once said: “In the long run, we’re all dead.” So the fact that $A$ is eventually better may or may not matter.

The key, of course, is the hidden constant in the O-notation. If algorithm $A$ runs in time $2^{100}n \log n$ and algorithm $B$ runs in time $10n^{2}$, then for all practical cases I would rather run $B$. This is the problem. What if we prove that there is an algorithm for SAT that runs in time $O(n^{2})$. This may or may not have any practical significance. If the hidden constant is large, then it will be of no practical import. Of course, it would be a great mathematical result. You would have resolved the issue of P=NP. You would win awards, prizes, and be famous. But, the practical value would be nil.

The flip side of this is also true. If you prove that any algorithm for SAT requires $\Omega(n^{100})$ time, then again the hidden constant matters. If the hidden constant is tiny, then this result will not show that SAT is really hard. Again you will win awards, prizes, and be famous, but the practical value would be nil.

The point is that these are not idle concerns, they have already happen. Many theory bounds have constants that make the asymptotic results not relevant. For example, the famous Robertson-Seymour theorem yields a $O(n^{3})$ algorithm for the recognition of certain graph properties. However, the hidden constant in this algorithm is huge, really huge, really really huge. Thus, while the result is a major achievement, it does not lead to a practical algorithm.

The problem happens often in lower bounds for approximation algorithms. Often these proofs are based on complex methods. I believe that when the proofs are completely unwound, the hidden constants will be immense. What does it mean to prove that there is no better than a $O(\log n)$ approximation method if the hidden constant is extremely tiny? While this results, like the Robertson-Seymour work are beautiful, they could be misleading. If $n$ has to be $2^{2^{100}}$ for the bound to hold, then what are we really saying about the difficulty of approximating the problem?

## Open Problems

The main open question that theory will have to address one day is to replace O-notation by some more exact model. We will at some point have to be able to really compare algorithms in a way that actually reflects they true performance.

One concrete challenge is to “unwind” some of the know algorithms and lower bounds, and discover their actual constants. For example, how realistic are the current lower bounds that are based on PCP’s? I wonder if they are realistic at all–could they require that the size $n$ of the problem be enormous? This in no ways takes away from their beauty, but it does take away from the practical impact. If the constants are huge, then we have an interesting paradox: we would have lower bounds for huge values of $n$; however, for realistic values of $n$ there could be fast algorithms. Strange, but possible.

7 Comments leave one →
February 13, 2009 3:16 pm

While you say that mathematics will one day have to replace the O-notation to reduce the abuse, sometimes we see math/CS going in the wrong direction. $\tilde{O}$ notation is used a lot now, and heavily abused too! Original intention of $\tilde{O}$ was to probably hide polylog n terms from say $n^2$, i.e., $O(n^2\log n)$ written as $\tilde{O}(n^2)$. Now, however, I have seen people even use $\tilde{O}(1)$ to represent $O(\log^2 n)$ — trying to indicate that the problem can be solved in (almost) constant time…

February 13, 2009 3:23 pm

Thanks for the comment. I fixed the latex so it will format okay. But did not change anything else.

The trick is $your math stuff$ must be made into “dollar symbol”latex your math stuff “dollar symbol” to work.

3. February 15, 2009 3:33 am

I absolutely agree that this is a major issue in computer science. There tends to be two extremes seen in papers and little in between. The one extreme being the purely theoretical worse-case big-O bounds, as you discuss here, and the other being where a very limited set of benchmark times are listed.

In the first case (worst-case big-O), not only are results often of little importance, but in order to make their results seem more important than they are, if any performance testing is done, the authors will often grossly falsify performance results. They usually get away with this by making horrible implementations of the existing algorithms to claim that when n > 100, their algorithm wins. To me this should be treated as academic fraud, but it usually just gets overlooked, because so few theoretical algorithms researchers can recognize poorly-written code. As an example of this, the current claim on Wikipedia about Strassen’s matrix multiplication algorithm is that an optimized version can beat brute force for matrices of size 32 to 128. However, that must be with the worst implementation of the brute force matrix multiplication in the world, not even counting that the brute force is easier to vectorize, parallelize, and can be done in a numerically stable way. The real number is actually more like the n=60,000 mentioned after, at which point the numerical instability has just destroyed the answer if you’re using Strassen’s.

In the second case (limited benchmarks), I’ve seen so much abuse that it’s depressing. What people tend to do is make algorithms that are custom-tuned to perform well on the benchmark problems and do horribly on all other problems. It’s often so easy to do this that the results mean nothing. One algorithm I found, implemented, and analysed literally had a special parameter that was set to predetermined values depending on which benchmark problem it was to be given. If you gave it some other problem and guessed on what value the parameter should be, it’d usually take hundreds to thousands of times longer.

What we need is something in between that addresses the limitations of both; something like having a large, systematic series of benchmarks that show some distribution of execution time (and accuracy for approximations) as the problem size or other parameters change. Then the real scaling can be seen from plots of the execution times. I’ll probably make a system for automating this type of thing eventually, but I’m pretty busy with other projects before I get there.

4. Janos Simon permalink
February 27, 2009 6:32 pm

“Large systematic benchmarks” for these kinds of algorithms are not easy to generate.

For one thing, the standard model is that of a processor with random access memory — uniform access to each element. You propose to run well-written code on a real computer–but your real computer is likely to have several processors, and something like two levels of caches. Plots of execution times for matrix multiplication algorithms of 1,000×1,000 matrices may reveal something about the pattern of cache misses (perhaps induced by quirks of your compiler).

5. Marcus Ritt permalink
February 27, 2009 8:17 pm

Hutter (The Fastest and Shortest Algorithm for All Well-Defined Problems“) took the purely theoretical worst case bound to an extreme: save a factor of five and additive constants most interesting algorithmic problems can be solved optimally.