Streaming models were started decades ago and continue to be even more exciting

Ian Munro is one of the world’s experts on data structures. If you need to know anything about data structures, either lower bounds or upper bounds, he is one of the few people I would call for advice. If he does not know, or he does not know who knows, or he cannot suggest an approach, you are in deep trouble.

Today I will present some work that Atish Das Sarma and Danupon Nanongkai have recently done on a variation of the streaming model, which was just presented at TAMC 2009. Ian also wrote one of the first papers on streaming, before there was streaming. More on this later.

Atish and Danupon are two graduate students who are working with me, but they are both quite independent and have already written papers with various colleagues from various places on various topics. How is that for a vague statement. In any event, I want to come clean, since I worked with them on their results.

Ian Munro is a long-time friend, and I have many stories about him, which could fill up an almost infinite number of posts.

Ian’s great sense of humor always makes it a pleasure to work with him. Whenever we have worked on a problem and gotten stuck, Ian has a trick that he uses to get us back on track. He says,

“let’s make a list of what we need to do.”

He then proceeds to take a piece of paper and writes on the paper:

1. ${\dots}$.
2. ${\dots}$.
3. ${\dots}$.
4. Make a list.

where ${\dots}$ are the detailed things that we need to do in order to solve the problem in question. For example, the first item might be: Find an algorithm that converts any ${X}$ into a ${Y}$ in linear time. He then takes the list and with his pen writes on it: now the list is,

1. ${\dots}$.
2. ${\dots}$.
3. ${\dots}$.
4. Make a list.

He then would say,

“well we can take a break, since we have done 25 per cent of the work.”

And off we go to get a drink, and hopefully also get an inspiration.

One of my other favorite Ian stories is about the time he got tenure. At STOC that year, he told a group of us that he had just gotten tenure. We all congratulated him—it was wonderful news. He then proceeded to point out that at Waterloo—where he was and still is—tenure is not official until the next trustee meeting. The trustee’s signing off on all the year’s tenure decisions was, of course, pro forma. But, it would not happen for about a month.

Ian then raised an interesting “open problem:” what could he do, in the next month, to get his tenure decision reversed? What behavior would undo tenure? Only Ian. It was a fun discussion, as we argued over which behaviors would lead to tenure being revoked, and which behaviors would not. The group sometimes agreed that this behavior would lead to a revoke, and that behavior would not. Sometimes we were divided. Thankfully, Ian decided not to test the various hypotheses and he got his tenure.

Let’s now turn to talking about the streaming model, and the work of Atish and Danupon.

The Streaming Model

The streaming model is both new and old. In computer science we sometimes create entire new areas of study. However, more often than not, a hot new area is an old one in disguise—or an old one with a slight twist. I think this is the case with the current area that is called streaming. As you probably know, in the basic streaming model, data is send one bit after another to the algorithm, which must compute the answer after the last bit has been sent.

So far the model does not restrict the algorithm at all—just have the algorithm store the input, and then run any method to solve the given problem. This strategy would make the streaming model trivial, so the rules of streaming are that the algorithm is always space limited. That is, if the stream is ${n}$ bits long, then the algorithm is only allowed to store ${s}$ bits, where ${s}$ is always much smaller than ${n}$. For example, ${s}$ might be restricted to be ${\sqrt n}$ or even to be polylog in ${n}$: that is ${s}$ might have to be bounded by ${\log^{O(1)} n}$.

The space restriction makes the model interesting, but does it make the model new? I think no and yes. The basic setup is nothing more than asking: what is the smallest number of states that are needed by a finite state automata (FSA) to solve the given problem? You know I love FSA, and have previously shown their power.

So what is new with streaming? Partly it’s that usually streaming algorithms use randomness, but we could of course replace a FSA by a random finite state automata. So that is not new, since such models have been studied for decades.

I think streaming is new for several reasons—even if the basic model is not really new. The reasons are:

${\bullet}$ The motivation is new. The size of inputs has increased to the point where they are extremely large. So large that the basic streaming model is no longer a theoretic model, but has become a practical model. If the data you wish to process is huge, then you may be forced to use the streaming model. For example, the new accelerator at CERN will generate so much sensor data that they plan on using streaming algorithms—the amount of data created per second is so large that they really have no choice.

${\bullet}$ The the class of problems is different. Studying FSA’s involved very different problems, which were often artificial ones. Now the main problems are more “real’ problems: graph problems, linear algebra problems, statistical problems, and many others.

${\bullet}$ The results are new. Many new and extremely clever ideas have been found by researchers working on the streaming model. The general streaming framework has been studied extensively since the seminal work of Noga Alon, Yossi Matias, and Mario Szegedy. Their beautiful results and new ideas excited the theory community tremendously, which led to their receiving the Gödel prize in 2005 for their work.

${\bullet}$ The best reason—in my opinion—is that the streaming model generalizes the notion of input that was previously used in the study of FSA’s. Streaming models allow more complex rules on how the data arrives to the algorithm. Let’s call this the order type of the streaming model. I will, in the next section, explain this in detail.

Yet, already in 1978 Ian Munro and Mike Paterson had results on computing the median in a model that was essentially the streaming model. They proved:

Theorem: The median of ${n}$ numbers can be computed in ${O(\sqrt n)}$ space with high probability in one pass over the numbers.

And,

Theorem: The median of ${n}$ numbers can be computed in ${O(\sqrt n \log n)}$ space deterministically with two passes over the numbers.

So is streaming new? You decide.

Order Types of Streaming Models

In the classic FSA model the order of the data is set by the definition of the problem. I have already talked about BDD’s which allow the input order to be selected so that the task of solving the problem is as easy as possible. Streaming models allow an even richer class of order types, which is the principle reason they are exciting.

To explain the notion of order type let us focus on problems that concern matrices. The general ideas apply to many other areas, but I think that the explanation will be easier if we focus first on matrices. Let ${M_{ij}}$ be an ${n \times n}$ matrix of integer values. We assume that ${n^{2}}$ entries of the form ${(m,i,j)}$ make up the input: ${(m,i,j)}$ means that ${M_{ij} = m.}$

Here are some possible input order types:

• Natural: This is an order that only depends on the input being a matrix. Some choices include row order and column order. Note the order has nothing to do with the problem being considered.
• Best Oblivious: This is the order that is used by BDD’s. The entries of the matrix are sent in the manner that is best for the problem. However, the order can only depend on the problem, not on the actual matrix ${M}$. This is why the order is called “oblivious.”
• Adversarial: This order allows an adversary to look at the input matrix ${M}$ and then decide what order to send the matrix entries to the algorithm. The order thus can be non-oblivious, and the adversary can try to make the input ordering as nasty as possible. This is the most popular, and standard order type for streaming models.
• Random: This is just what you would guess: the entries of ${M}$ are send in a random order.
• Sorted: This order sends the entries of ${M}$ after they have been sorted according to some rule. For example, the order could be: send the entries sorted first by the value, and then by the indices.
• Best Order: This is a new order that was introduced by Atish and Danupon. More on this in a minute.

In all these models often the ability to send the input more than once to the algorithm is allowed. Thus, we could look at the random order type, but allow ${2}$ passes over the input. We could even mix the order types: first random, then adversarial—for example. Note, Ian and Mike’s results were for the adversarial model with either one or two passes.

Best Order Streaming Model

Any streaming model can be viewed as a protocol between Alice and Bob. Alice wants to determine some property of the matrix ${M}$, and Bob sends the entries of the matrix to Alice. For example, in the adversarial streaming model, Bob can look at the whole matrix and select the worst order possible. Clearly, just as the standard worst case complexity model is the “best”, this model is the best if Alice can solve the problem in a reasonable amount of storage.

Unfortunately, for many problems this model is too hard. In one of the earliest papers in this area, it was shown that many problems require prohibitively large amount of space to solve, if the order is worst case. This result has been extended by a more recent result that shows that most graph problems cannot be solved efficiently—even if multiple passes are allowed.

In the best order model Alice and Bob work together: however, Alice does not trust Bob, so she must check that Bob is playing fair. Her big advantage over the adversarial streaming model is that Bob will send the entries in an order that is helpful. Still Alice must check that Bob is playing by the rules. Another way to think of the best order model is that Bob is a prover and Alice is a checker. The prover can play fair and give Alice an easy-to-check proof, or he can try to cheat. In either case Alice must carefully check Bob.

Perhaps an example will help. Suppose that Alice wants to determine whether or not a matrix is the adjacency matrix of a directed cycle of length ${n}$. Let the cycle be:

$\displaystyle a_{1} \rightarrow a_{2} \rightarrow \dots \rightarrow a_{n} \rightarrow a_{1}.$

Then, Bob will send the entries in the order: first ${a_{1} \rightarrow a_{2}}$, then ${a_{2} \rightarrow a_{3}}$ and so on until ${a_{n} \rightarrow a_{1}}$, finally he will send all the zero entries. Clearly, for this ordering Alice needs only to remember the start vertex, and the check that each step is correct and that the last step is back at the start. Her storage is ${2\log n}$ and the algorithm is deterministic.

Well not quite. Alice is a checker, so she must be prepared that Bob could send a vertex twice:

$\displaystyle 1 \rightarrow 2 \rightarrow 3 \rightarrow 1 \rightarrow 4 \rightarrow 1$

is not valid but would fool Alice, if she is not careful. Thus, Alice must also check that this does not happen. She can do all this in polylog space provided she can uses randomness.

The Main Results

Atish and Danupon prove a number of results in their paper. They study the best order complexity of several graph properties. Instead of entries of a matrix, now the objects being sent by Bob are edges of the graph. The pioneer work in this area was done by Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang in their beautiful paper. They proved, among many other results, the following:

Theorem: In the adversarial streaming model, proving a graph is connected requires ${\Omega(n) space}$.

As usual ${n}$ is the number of vertices of the graph. However, in the best order model the following is true:

Theorem: In the best order streaming model, proving a graph is connected can be done in random ${O(\log^{2} n)}$ space.

Thus, there is an exponential improvement from the adversarial order to the best order.

I will briefly sketch the idea of the proof of the best order result, as usual please see the paper for the details. The key idea is that Bob will first send a spanning tree of the graph. Then, Alice only needs to check connectivity for trees. The trick is to break up the spanning tree into pieces so that Alice can check each piece easily, and also so that she can put the pieces together. Recall all this must be done in small space—${O(\log^{2} n)}$ space.

Another result is:

Theorem: The problem of proving that a graph has a perfect matching can be done by a randomized algorithm in ${O(\log n)}$ space in the best order streaming model.

Covert Channels

In the above protocols it is necessary for Bob to send more than edges, sometimes he needs to send short messages to Alice. For example, in the connectivity result he needs to send the message: “that is the end of the spanning tree.” In the best order model, like the other models of streaming, only the edges can be sent. Yet Bob can insert short comments, by using a covert channel. Bob can encode messages into how he sends the edges to Alice. There are several ways to do this: suppose we have an ordering on the vertices of the graph and let ${x. One of simplest ways is to send the edge ${\{x,y\}}$ or the edge ${\{y,x\}}$ to encode a single bit. When Alice gets an edge ${\{u,v\}}$ she checks if ${u. In this way Bob can send her one bit per edge.

Note, we could have allowed Bob to inject “messages,” but we liked the fact that the best order model was like all the streaming models. However, Bob can use the covert encoding tricks to send extra messages to Alice. In all the algorithms so far this covert trick has been sufficient. I like that covert channels, which are usually viewed as a problem in security, can be used here for a positive purpose.

Here is the sketch of the protocol for the perfect matching problem—note the need for one covert message. Bob sends ${n/2}$ edges of a perfect matching to the Alice, followed by a message “that is all of the matching,” which can be implemented by flipping the order of vertices in the last edge. Then, Bob sends the rest of the edges. Alice has to check three things.

1. Find ${n}$ by counting the number of edges before the “that is all of the matching” message is received.
2. Check if the first ${n/2}$ edges form a perfect matching. This can be done by checking if the union of the labels of vertices in the first ${n/2}$ edges equals ${\{1,2,\dots,n\}}$. This can be done by using “fingerprinting” via a random hash function.
3. Check if there are ${n}$ vertices. This is done by checking that the maximum vertex label is at most ${n}$.

Alice accepts, if the input passes all the above tests. This yields a random protocol.

Open Problems

There are many open problems in the streaming area in general, and with respect to the best ordering in particular. The following are some problems that have not been solved for best order:

• Is a graph bipartite?
• Is a graph planar?
• Is a graph 2-connected?
• 10 Comments leave one →
August 25, 2009 12:52 pm

The covert channels may be cute, but they are rather annoyingly artificial.

August 25, 2009 2:39 pm

Well, I think they are artificial by definition. I liked that we did not need to extend the model to add messages, which would be very different than previous models. But I see your point. If you like add messages, the theorems will still be true.

August 27, 2009 11:43 am

Yes, of course. I found your cloud-computing motivation in the paper interesting. However, while I might not trust Amazon to tell me if a graph is strongly connected (because of possible bugs), I might well trust them to follow simpler rules. I am not sure of a good example, but say the verifier is looking for all edges incident to a vertex to be presented at the same time. I would be willing to trust Amazon’s code for such a simple rule.

Theoretically, this would seem to boil down to a best-order streaming algorithm in which the dishonest provers also must satisfy certain order restrictions. Is there anything interesting to say here?

August 27, 2009 12:16 pm

I think there is lots to say. The usual idea is if we can we would like to have the minimal trust possible—especially if the cost of that is not too steep. So I think there could be some interesting questions to look at here.

4. September 2, 2009 11:57 am

Hi,

An insightful post. I would add another item to your “why streaming is new” list: the emphasis on solving *approximate* problems. I believe (although I am not senior enough, so someone correct me if I am wrong) that there was little work done on approximation in automata theory. In contrast, approximation is what really makes streaming tick, IMHO. Here are some reasons:

- Even for some basic problems, like counting the number of distinct elements in a stream, you need linear space if you want to solve them exactly (even if you allow randomization). Once you approximate, though, you can reduce the space exponentially, down to polylog n. So you need to approximate to do something useful.

- Once you allow approximation, you can employ tools from other areas, such as metric embeddings, dimensionality reduction, and other friends with heavy hammers.

- For some reason (luck?), most approximate streaming algorithms actually provide (1+eps) approximation for any eps>0, with reasonable trade-offs between accuracy and space. This means you can taylor the algorithms to particular situations, which helps in applications.

Also, talking about friends with heavy hammers: the fact that you can use communication complexity tools to prove (tight) streaming lower bounds was a really nice coincidence. For a person who does lower bounds if all attempts at algorithms fail, it is really important to know when to quit.