# New Streaming Algorithms for Old Problems

* A streaming algorithm for the classic Dyck language*

Claire Mathieu, previously Claire Kenyon, is an expert in the design and analysis of algorithms, which should be no surprise since she worked as a graduate student with two of the world’s best—Philippe Flajolet and Jeffrey Vitter. Claire has done and continues to do some very pretty work on all aspects of algorithms.

Today I plan to talk about a recent paper of Frédéric Magniez, Claire Mathieu, and Ashwin Nayak on streaming algorithms, and how it relates to work that was in vogue years ago.

I recall one day at Princeton when Bob Sedgewick and I had, what we thought, was a great idea. At the time we were working on a front end to **TeX**, which we called “notech.”

Notech was a preprocessor to **TeX** that extended some of the ideas that Don Knuth put into **TeX**: except we carried them to the extreme. Let me explain. Before **TeX** there were other languages that were typesetting languages. One was called Troff, and was developed at Bell Labs as part of the UNIX project. Troff was not bad—for example, many textbooks were written with it—but it had serious limitations. My first typeset paper was done with it, but Troff’s limits helped drive Knuth to invent **TeX**.

One of the features of Troff that I hated was that paragraphs were separated from each other by not a blank line as in **TeX**, but by the symbols *.PP*. This made the raw text very hard to read, in my opinion. In Troff this is how I would have written this paragraph:

.PP

One of the limits of Troff that I hated was that paragraphs were separated from each other by not a blank line as in TeX, but by the symbols

.ft I

.PP.

.ft R

This made the raw text very hard to read, in my opinion. In Troff this is how I would have written this paragraph:

I will stop here—the danger of recursion scares me.

I loved that **TeX** used the obvious rule that a blank line meant a new paragraph—what else could it mean? Bob and I worked out a series of similar rules each of the form:

if you typed something in a “reasonable way,” notech would try, via its rules, to guess what you meant, and typeset the material appropriately.

Notech was a pretty good guesser, and for years I used it as my own front-end to **TeX**. A notech file was mostly plain text—thus the name—except for math formulas. We could never figure out anything that was better than what Knuth invented for math formulas. Although we did use matching and as math delimiters, we left math basically alone. Here is a small part of a paper I wrote in notech:

A key lemma for the lower bound is the following:

Main Lemma: There exists a family of bounded degree directed acyclic graphs {G_n} on {n} vertices and an {\epsilon > 0} so that {G_n} has no {\epsilon n}-segregator with cardinality less than {\epsilon n}.

You might ask, what does this have to do with Claire? She is French and is of course fluent in both English and French. One day at a Princeton tea, Bob and I asked Claire a question: are there two words in French that have the same letters, but have different accents? If there were none, or if such words were rare, then we could program notech to automatically add accents to French words. This seemed to us to be a cool possibility, since accents were a pain to do in notech or **TeX**.

Her initial response was favorable. She said that she could not, immediately, think of any French words with the same letters, that had different accents. Bob and I were excited, and we explained to her that this would allow us to write a program that would supply the missing accents, since they were *redundant*.

Redundant did not go over well. Claire walked off and after a few minutes returned with a long list of words that changed with different accents. Our plan was ruined, and she had defended the French language. It would have been a great feature, in my opinion, if accents could be automatically added to a word. I still struggle to get them right—especially on names. Oh well.

Let’s turn to discuss her neat paper on streaming and Dyck Languages, and leave accents behind.

** Streaming for Dyck Languages **

Frédéric Magniez, Claire Mathieu, and Ashwin Nayak have written a pretty paper titled Recognizing well-parenthesized expressions in the streaming model.

The language is the language of all properly nested parenthesis expressions over two different types of paired parenthesis.

The parentheses ( and ) must be paired, and the parentheses [ and ] must be paired. Thus,

is in the language and

is not. More precisely, the empty string is in the Dyck language; and if is in the language, then so is and . Yes, there is a language , which has three types of paired parenthesis, and has four, and so on. A simple insight is that is already the “general case” for the streaming problem.

They prove:

Theorem:There is a one-pass randomized streaming algorithm for with space and time . If the stream belongs to then the algorithm accepts it with probability ; otherwise it rejects it with probability at least where is a constant.

They also prove a lower bound that almost matches their algorithm’s space bound. I really like their result. First, it uses a non-trivial method to manage a potentially large stack, which, I believe, could be used in other streaming algorithms. Second, their result is related to an old language theory result that I will discuss in the next section. I like the connection between the current hot topic of streaming and an old topic.

Here is a short overview of how they prove their theorem. As usual please look at their paper for the full proof.

It may help to explain the naïve—note the accent—algorithm for determining whether or not a string is in . The algorithm uses a pushdown, which is initially empty. I will abuse the definitions a bit and call the pushdown a stack: recall a stack, in language theory, allows more operations that just push and pop.

The algorithm processes each input symbol in turn; thus, there are four cases based on the current input symbol:

- The input is (. In this case just push the symbol onto the stack.
- The input is [. In this case also just push the symbol onto the stack.
- The input is ). In this case check that the top most symbol is a matching (. If it is not, then reject; otherwise, pop off the top of the stack.
- The input is ]. In this case check that the top most symbol is a matching [. If it is not, then reject; otherwise, pop off the top of the stack.

When there is no more input, the algorithm accepts if and only if the stack is empty. It is not hard to see that this algorithm works correctly with only one pass over the input. Note, the string

will be rejected because at the third input symbol the stack will be

and the input is . Since the top does not pair off with the input, rule (4) forces the algorithm to reject.

Their clever insight is that it is possible to simulate the stack with much less space, than used by the naïve algorithm. Note, the stack can get as large as order in size for a string that is in the Dyck language: just have an input that starts with many ( and [ symbols.

The key idea is that the contents of the stack can be hashed, and in this manner the space requirements can be reduced from order to almost . The hashing is not completely straightforward—they need to carefully look at the types of motions of the stack. They must analyze how the stack can move up and down during the operation of the naïve algorithm. See their paper for the details.

** The Connection to Language Theory **

The Dyck language, named for Walther von Dyck, is fundamental to classic language theory. It is a special type of context-free language, called linear, but it is “universal” in a sense made precise by the beautiful Chomsky-Schützenberger Theorem:

Theorem:A language over is context-free if and only if for some finite alphabet , there is a homomorphism such that

where , the Dyck language over and is a regular language over .

The theorem is, of course, due to Noam Chomsky and Marcel-Paul Schützenberger.

This connection raises some interesting questions between their new streaming result and classic language theory. Can they use their method, for example, to give a streaming bound on all context-free languages?

I will end this discussion by pointing out that I have already discussed a close relative to the Dyck language. Consider the free group on letters. Given a word from the free group, some words, like would trivially equal . Let us call the language defined by such words . is like except that parentheses can cancel in either order: for example,

is now allowed. Sometimes the language is called, for obvious reasons, the two-sided Dyck language.

Zeke Zalcstein and I showed years ago that the following is true:

Theorem:There is an one-pass randomized streaming algorithm for the language with space and time . If the stream belongs to the free group then the algorithm accepts it with probability ; otherwise it rejects it with probability at least where is a constant.

Well *not exactly*. As Frédéric Magniez, Claire Mathieu, and Ashwin Nayak generously point out, we proved a theorem that is equivalent to this, but we proved our theorem before there was a even a streaming model. I discussed our result and its strange history here.

** Open Problems **

Is their upper bound tight? More fun might be to try and see what is the streaming complexity of other languages. In particular, what happens for all context-free languages? What about linear ones?

Why is the Dyck language so much harder in the streaming model than the free group, the two-sided Dyck language?

Also their stack idea is quite ingenious—can similar ideas be used in other algorithms?

[Some minor updates.]

Rather than the one-sided Dyck languages, there is the “hardest” context-free language of Sheila A. Greibach, SIAM J. Comput. 2(4):304–310, 1973.

Yes. Extensions to hardest CFL would be interesting. I would like to point out that hardest CFL is used by Sudborough to give a machine class equivalence of the complexity class LogCFL.

Ivan Hal Sudborough. On the tape complexity of deterministic context-free languages. J. ACM,

25(3):405–414, 1978.

It’s amazing that mathematicians and scientists are still using TeX/LaTeX in 2009 — especially when there is a much better alternative, namely TeXmacs. TeXmacs is not just a front-end to TeX/LaTeX. Rather it has its own high quality real-time typesetting engine with WYSIWYG editing. See http://www.texmacs.org/Samples/texmacs.pdf

The fact that the free group has faithful, finite-dimensional representations deserves to be better known – your DLOG result is very nice. Of course, with a quantum computer one can do the same thing in _constant_ space: just choose two uniformly random unitaries A and B, in any number of dimensions d > 1, apply A, A^-1, B, or B^-1 in the appropriate order, and check to see if you’re close to whatever initial state you started with. If the word is not the identity, then the final state is uniformly random, and so you’ll reject with probability 1-1/d.

I guess the reason this sort of thing doesn’t work for the Dyck language is that semigroups are harder to represent with finite-dimensional matrices than groups are. Does the Dyck semigroup have any faithful finite-dimensional representations?

To rudely answer my own question: the Dyck semigroup has no faithful finite-dimensional representations, since the ideals starting with ), )), ))), and so on must correspond to an infinite descending sequence of subspaces.

Following an email correspondence with Cris Moore about his comment here, I think we have agreed that his quantum algorithm can’t be carried out in constant space, since the precision required during the application of the randomly selected unitaries increases with input length. In fact, there is a folk theorem that says that no one-pass quantum streaming algorithm can recognize a nonregular language with bounded error.

Did wonder if it was okay to select constant number bit matrices. Can you do better than classic still?

Oops, I should have written “In fact, there is a folk theorem that says that no one-pass quantum streaming algorithm can recognize a nonregular language with bounded error IN CONSTANT SPACE.”.

Le français est complexe, sans doute !!

Speaking about problems of words on groups (in this case not a recognition one but an optimization problem)i´ve been shocked recently by the following result from Jerrum: “The complexity of finding minimum-lenght generator sequences”. It is not NP-complete…it is PSPACE-complete even when the generators set is restricted to two permutations !

As the author explains, “An interesting feature of this problem is that it does not fall under the headings of ‘two person games’ or ‘formal languages’ which cover the great majority of known

PSPACE – complete problems”.

I have a question. Does anyone know if real-time Turing machines (one-pass streaming algorithms which spend one step on each input symbol) that use only sublinear workspace are any better than real-time Turing machines with NO worktape?

What if the space bound is sublogarithmic?

Here is the answer to my question: http://arxiv.org/PS_cache/arxiv/pdf/1108/1108.2613v1.pdf

a, à

ou, où

du, dû

passe, passé

lègue, légué

… and so on. It it were possible to do away with French accents, they wouldn’t even exist in the language. I don’t think any language can have a long history of using a set of symbols that could be discarded without altering the understanding. People usually don’t want to overload their memory without a serious purpose. By the way, in French once you’ve done the job of mastering the many vowel sounds you’re halfway through being able to speak it fluently… though I must admit that the remaining half is quite tough. 🙂