Jeff Naughton is a top expert in the area of databases. He was at Princeton, until he was stolen away to join the computer science department at Madison, Wisconsin. Jeff has a great sense of humor, is a terrific colleague, and is a great researcher.

Today, I want to talk about two projects that Jeff and I did together. The first, is a randomized algorithm that uses an adaptive sampling method to solve a database problem. I always liked this result, because so few random algorithms, that use sampling, need to be adaptive. The second, is a “hole” in the worst case model of complexity theory. The hole makes certain classic theorems of complexity theory “wrong.” That’s right.

I enjoyed working with Jeff on both of these projects. The first, on sampling, was submitted to SIGMOD, was accepted there, and then received the best paper award. I am, therefore, 1/1/1 (submit/accept/award) at SIGMOD–I have never submitted another paper to that conference, nor will I ever. The conference was fun too: partially because it was held in Atlantic City at a hotel-casino, and partially because it had a magic show as the banquet entertainment.

Magic is somehow related to theory and mathematics. For example, Steven Rudich is a terrific amateur magician; Grzegorz Rozenberg, a top language theorist, is also a professional magician. Although with him please call them “illusions,” never “tricks.”

One thing I have learned after years of watching magic and reading about it, is the principle of least work. In any magic trick–excuse me illusion–there are usually extra constraints that are placed on the magician or their assistants. A classic example, is a rug Harry Houdini used to cover the stage so you could be sure there were no trapdoors. The principle of least work, implies that this constraint, the rug, is required to make the illusion work, since it hides the trapdoor.

I sometimes think that the key to solving a hard problem is to apply this principle. Can an obstacle, be turned into the key that yields the solution to a problem? Can we use the principle of least work in our quest for results?

Transitive Closure Estimation

The first question was estimating the size of a database query, without actually computing the query. One motivation, for this, is query optimization: often a database query can be performed in different ways. If one way leads to a huge partial answer, then it might be better to perform the query by another method.

The problem we studied is quite simple to state: Given a directed graph ${G}$, what is the size of the transitive closure of the graph? The transitive closure is simply the graph obtained by adding an edge from vertex ${x}$ to vertex ${y}$, if there is a directed path from ${x}$ to ${y}$. A graph ${G}$ can have few edges, and yet its transitive closure can be much larger. A prime example, is a directed path

$\displaystyle 1 \rightarrow 2 \rightarrow 3 \rightarrow \dots \rightarrow n-1 \rightarrow n.$

This graph has ${n-1}$ edges. Its transitive closure has,

$\displaystyle n-1 + n-2 + \dots + 1 \approx n^{2}/2$

edges. For large ${n}$ this is huge increase. What the database folks wanted to know, according to Jeff, was for a given graph, how large is the transitive closure?

Luckily, we did not need the exact answer, an approximation would be fine: we could be off by a multiplicative factor. But, the time for doing the estimate had to be as close to linear as possible. We were able to partially solve this problem:

Theorem: Suppose that ${G}$ is a directed graph with ${n}$ vertices. If the degree of ${G}$ is bounded, then there is a random algorithm that in ${O(n)}$ time, determines the size of the transitive closure of ${G}$ to within a factor of ${2}$, with high probability.

See our paper for the details, and for a more general trade-off between the running time and the accuracy. Later Edith Cohen proved the theorem without the restriction of bounded degree. Her work is very clever, and is based on seeing the structure of the problem much more clearly than we did.

A Sampling Lemma

I will not go into detail on how we proved this theorem, but will instead explain our main lemma. Perhaps this lemma can be used in other applications.

Consider an urn that contains ${n}$ balls. Each ball has a label, which range from ${1}$ to ${n}$. The task is to estimate the sum of the labels on the balls. As you might guess–it’s an urn–you may randomly remove a ball and see its label. The catch is this: the cost of sampling a ball is equal to its label.

The problem is to estimate the sum of the labels, but keep the cost low. This seems to force the sampling to be adaptive. Here is a heuristic argument that shows that using a fixed number of samples will not work.

Let Alice attempt to estimate the size of an urn, which was created by Bob. Suppose she fixes the number of samples in advance to ${s}$. If ${s}$ is large, then Bob will make all the balls have size ${n}$, and her cost will be ${sn}$, which is a non-linear cost. On the other hand, if ${s}$ is small, then Bob can do two things. On the one hand, he could make the urn all ${1}$‘s. Or he could make the urn contain order ${\frac{n}{st}}$ balls with label ${n}$ and the rest with label ${1}$. Then, with probability,

$\displaystyle (1-\frac{1}{st})^{s} \approx 1-1/t$

Alice will see only balls with the label ${1}$, but, the sum of the labels is ${\Omega(n^{2}/st)}$. Alice seems to be stuck: if she says the urn sum is about ${n}$, then she can be wrong; if she guesses it is larger, then also she can be wrong.

There is a simple adaptive sampling algorithm that does work. Randomly sample, with replacement, balls with labels ${b_{1}, \dots, b_{k}}$ until

$\displaystyle b = b_{1} + b_{2} + \dots + b_{k} \ge cn$

where ${c>0}$ is a constant. Then, stop and estimate the sum of the urn as

$\displaystyle v = \frac{bn}{k}.$

This is clearly an unbiased estimator for the sum of the balls in the urn. The key issue is whether the estimation ${v}$ is likely to be good.

Lemma: The value ${v}$ is likely to be within a factor of ${2}$ of the urn sum, for ${c}$ large enough.

The exact relationship between ${c}$ and the probability is in our paper. Note, the method always has cost at most linear.

One day, when Jeff was still at Princeton, he pointed out to me a curious fact: in the last few years, he said, no assistant professor had ever been hired to the department, when I attended their job talk. For example, I missed Jeff’s. I do like to ask questions, and sometimes can get into trouble, but his comment bothered me. I did once ask a question at a system job talk that got the answer: “that is a dangerous question.” But, that’s another story. Jeff’s comment bothered me, so my solution was to attend all the job talks the following year, every single one. Since we did hire that year, the string was broken.

The second question arose from a job talk I attended with Jeff at Princeton–okay I did sometimes cause problems. The speaker was talking about a network routing problem, and he stated a theorem. The theorem was of the form:

Theorem: Against any adversary with high probability all routing steps will take at most ${O(\log n)}$.

This is the outline of the theorem, but as he stated it, I realized that it was “false”. There were adversaries that could defeat the algorithm. Yet his theorem was technically correct.

Confused? The issue is a fundamental assumption built into the adversary worst case model, that is usually true. But, in his application, I thought that it could easily be false, since he allowed the routing requests to be created by potentially adversarial programs. After his talk, Jeff and I started to discuss the issue and we soon realized that there was a problem in the worst case model: it was not worst case enough.

The Attack

The issue is clocks. Algorithms can have clocks, and therefore algorithms can tell time. This ability is missing in the usual model, but is always there in practice. Instead of explaining the problem with the job candidate’s result, I will explain it in the simpler context of hashing with linear chaining.

This is a basic and popular data structure. Recall that there is a hash function that selects where to place an object, and if that location is taken the object is put into a linear list. To lookup an object, the hash function is used to find the linear list, and then that list is searched.

This is a popular and useful data structure. Both the insertion and lookup operation take time ${O(1)}$ provided the number of objects is not too large compared with the range of the hash function. However, if we assume that an adversary supplies the values to be stored, then it is possible to make the behavior of the data structure dramatically worse. The adversary need only arrange that all the values hash to the same location.

The standard answer to this threat, in both theory and practice, is to select the hash function randomly from a family of functions. Then, it can be proved that, with high probability, the hash table operates as expected: operations take constant time.

This is essentially what was claimed in the routing theorem, by our job candidate. What Jeff and I showed is that this is wrong. Provided the hash function family was known there was an attack that could make the data structure perform badly. The attack was based on the ability to time operations.

The attack works in two phases. In the first phase the exact hash function is discovered by using timing. In the second phase this is used to flood the data structure with values that are stored at the same location.

The first phase is simple: start with an empty data structure and begin adding randomly selected ${x}$‘s to the data structure. Suppose that we had a way to discover ${x}$ and ${y}$ that collide, i.e. so that

$\displaystyle h_{j}(x) = h_{j}(y)$

where ${h_{j}(x)}$ is the hash function. Then, we show that we can unravel the value of ${j}$. Note, this works for a large class of popular hash functions. For others, several values may be needed: see the paper for details.

The only issue is how do we find collisions? Since the values are random, collisions will occur by the “birthday paradox” in about ${\sqrt m }$ operations, where ${m}$ is the range of the hash functions. Thus, the only issue is how do we detect that a collision has occurred? The attack uses the ability to clock operations to discover which are collisions and which are not. Simply, collisions yield longer lists, which take more time. If you are worried about how accurate the timing must be, then we can repeat operations to get a better statistical sample. But the idea is sound, and real implementations of this attack do occur.

Time in Cryptography

Timing attacks have been used in cryptography to break systems, that would otherwise be unbreakable. The extra information available via timing often suffices to break very powerful systems.

Paul Kocher first showed how to use timing to break systems like RSA, while Dan Boneh and David Brumley later showed how to attack OpenSSL–an important practical security protocol. This latter work is especially interesting, since Dan and David can make their attack work against protocols that are running on remote servers.

I think in the future I will discuss these attacks in more detail, but for now it is interesting to point out that time can be used in real system attacks. Jeff and I were the first to publish the potential of using time for breaking systems, but it seems clear that Paul’s work was completely independent of ours.

Open Problems

One open problem is to determine the exact size of the transitive closure of a graph in linear time. While the random approximation algorithms are quite useful, there is no reason that I know, that precludes the possibility that there is a linear time exact algorithm. Note, for undirected graphs the problem is easy to do in linear time: find the connected components in linear time, and then compute the size of the transitive closure.

Can the techniques of property testing be applied to estimate the size of the transitive closure? Or are there other ways to estimate the size of the transitive closure? Note, the estimation of the transitive closure of a directed graph is closely related to a matrix problem. Suppose that ${A}$ is an ${n}$ by ${n}$ sparse invertible matrix. How many non-zero entries will ${A^{-1}}$ have? Perhaps this connection can be used with known low rank approximation methods to beat our bounds.

Finally, the “lower bound” I sketched to motivate adaptive sampling is not a rigorous theorem. Make it one. What are the best bounds that are possible for this urn problem? Or even for generalizations where balls can have general labels?

June 21, 2009 7:42 pm

I had the pleasure of reading your (adaptive sampling) paper in my first year and applied the elegant technique in one of my papers [1]. I am sure there are other unexplored applications of adaptive sampling in social network analysis.

Approximating Betweenness Centrality
In the Proceedings of Workshop On Algorithms And Models For The Web-Graph (WAW) 2007, WAW 2007.

June 21, 2009 9:25 pm

Just to add to the mathematician-magicians, Persi Diaconis was a professional magician before beginning to seriously study mathematics.

http://news.stanford.edu/news/2004/june9/diaconis-69.html

I think it has something to do with creativity.

June 22, 2009 6:42 am

Yes he was a professional–great point. Creativity is one issue, but I wonder if there are other connections?

June 22, 2009 2:47 am

Is it known how to to compute the size of the transitive closure in linear time when the graph is a dag ? This could be an interesting special case to consider since the dag of strongly connected components can always be computed in linear time.

June 22, 2009 6:39 am

No dag is the hard case. We can use linear time to find the strong connected components, so the whole issue is the estimation of the size of the dag.

I should have pointed this out, great question.

June 22, 2009 4:10 am

Is your paper on clocked adversaries available online without a paywall? It sounds very relevant to the issue of algorithmic-complexity attacks, which have gotten some attention in computer security. http://www.cs.rice.edu/~scrosby/hash/ is the best-known paper I’m aware of. It says

Tim Peters and Solar Designer have claimed that it may be possible to force collisions even in the case of keyed cryptographic or universal hashing, by statistically attempting to identify, due to detectable processing latency, whether two items collide or not. If this attack is possible, it would probably require a signifigant number of probes, perhaps greater than the cost of the attack itself, but would have the potential to attack any hash table that didn’t have its key altered regularily. It is unknown if this attack is practical.