A streaming algorithm for the classic Dyck language

Claire Mathieu, previously Claire Kenyon, is an expert in the design and analysis of algorithms, which should be no surprise since she worked as a graduate student with two of the world’s best—Philippe Flajolet and Jeffrey Vitter. Claire has done and continues to do some very pretty work on all aspects of algorithms.

Today I plan to talk about a recent paper of Frédéric Magniez, Claire Mathieu, and Ashwin Nayak on streaming algorithms, and how it relates to work that was in vogue years ago.

I recall one day at Princeton when Bob Sedgewick and I had, what we thought, was a great idea. At the time we were working on a front end to TeX, which we called “notech.”

Notech was a preprocessor to TeX that extended some of the ideas that Don Knuth put into TeX: except we carried them to the extreme. Let me explain. Before TeX there were other languages that were typesetting languages. One was called Troff, and was developed at Bell Labs as part of the UNIX project. Troff was not bad—for example, many textbooks were written with it—but it had serious limitations. My first typeset paper was done with it, but Troff’s limits helped drive Knuth to invent TeX.

One of the features of Troff that I hated was that paragraphs were separated from each other by not a blank line as in TeX, but by the symbols .PP. This made the raw text very hard to read, in my opinion. In Troff this is how I would have written this paragraph:

.PP
One of the limits of Troff that I hated was that paragraphs were separated from each other by not a blank line as in TeX, but by the symbols
.ft I
.PP.
.ft R
This made the raw text very hard to read, in my opinion. In Troff this is how I would have written this paragraph:

I will stop here—the danger of recursion scares me.

I loved that TeX used the obvious rule that a blank line meant a new paragraph—what else could it mean? Bob and I worked out a series of similar rules each of the form:

if you typed something in a “reasonable way,” notech would try, via its rules, to guess what you meant, and typeset the material appropriately.

Notech was a pretty good guesser, and for years I used it as my own front-end to TeX. A notech file was mostly plain text—thus the name—except for math formulas. We could never figure out anything that was better than what Knuth invented for math formulas. Although we did use matching ${\{}$ and ${\}}$ as math delimiters, we left math basically alone. Here is a small part of a paper I wrote in notech:

A key lemma for the lower bound is the following:

Main Lemma: There exists a family of bounded degree directed acyclic graphs {G_n} on {n} vertices and an {\epsilon > 0} so that {G_n} has no {\epsilon n}-segregator with cardinality less than {\epsilon n}.

You might ask, what does this have to do with Claire? She is French and is of course fluent in both English and French. One day at a Princeton tea, Bob and I asked Claire a question: are there two words in French that have the same letters, but have different accents? If there were none, or if such words were rare, then we could program notech to automatically add accents to French words. This seemed to us to be a cool possibility, since accents were a pain to do in notech or TeX.

Her initial response was favorable. She said that she could not, immediately, think of any French words with the same letters, that had different accents. Bob and I were excited, and we explained to her that this would allow us to write a program that would supply the missing accents, since they were redundant.

Redundant did not go over well. Claire walked off and after a few minutes returned with a long list of words that changed with different accents. Our plan was ruined, and she had defended the French language. It would have been a great feature, in my opinion, if accents could be automatically added to a word. I still struggle to get them right—especially on names. Oh well.

Let’s turn to discuss her neat paper on streaming and Dyck Languages, and leave accents behind.

Streaming for Dyck Languages

Frédéric Magniez, Claire Mathieu, and Ashwin Nayak have written a pretty paper titled Recognizing well-parenthesized expressions in the streaming model.

The language ${\mathsf{DYCK}(2)}$ is the language of all properly nested parenthesis expressions over two different types of paired parenthesis.

The parentheses ( and ) must be paired, and the parentheses [ and ] must be paired. Thus,

$\displaystyle [\, (\, )\, ( \, )\, (\, )\, ]\, [\, (\, )\, ]$

is in the language and

$\displaystyle [\,(\,]\,)$

is not. More precisely, the empty string is in the Dyck language; and if ${X}$ is in the language, then so is ${(X)}$ and ${[X]}$. Yes, there is a language ${\mathsf{DYCK}(3)}$, which has three types of paired parenthesis, and ${\mathsf{DYCK}(4)}$ has four, and so on. A simple insight is that ${\mathsf{DYCK}(2)}$ is already the “general case” for the streaming problem.

They prove:

Theorem: There is a one-pass randomized streaming algorithm for ${\mathsf{DYCK}(2)}$ with space ${O(\sqrt n \log n)}$ and time ${\mathsf{polylog}(n)}$. If the stream belongs to ${\mathsf{DYCK}(2)}$ then the algorithm accepts it with probability ${1}$; otherwise it rejects it with probability at least ${1-n^{-c}}$ where ${c>0}$ is a constant.

They also prove a lower bound that almost matches their algorithm’s space bound. I really like their result. First, it uses a non-trivial method to manage a potentially large stack, which, I believe, could be used in other streaming algorithms. Second, their result is related to an old language theory result that I will discuss in the next section. I like the connection between the current hot topic of streaming and an old topic.

Here is a short overview of how they prove their theorem. As usual please look at their paper for the full proof.

It may help to explain the naïve—note the accent—algorithm for determining whether or not a string is in ${\mathsf{DYCK}(2)}$. The algorithm uses a pushdown, which is initially empty. I will abuse the definitions a bit and call the pushdown a stack: recall a stack, in language theory, allows more operations that just push and pop.

The algorithm processes each input symbol in turn; thus, there are four cases based on the current input symbol:

1. The input is (. In this case just push the symbol onto the stack.
2. The input is [. In this case also just push the symbol onto the stack.
3. The input is ). In this case check that the top most symbol is a matching (. If it is not, then reject; otherwise, pop off the top of the stack.
4. The input is ]. In this case check that the top most symbol is a matching [. If it is not, then reject; otherwise, pop off the top of the stack.

When there is no more input, the algorithm accepts if and only if the stack is empty. It is not hard to see that this algorithm works correctly with only one pass over the input. Note, the string

$\displaystyle [\,(\,]\,)$

will be rejected because at the third input symbol the stack will be

$\displaystyle \stackrel{\big (}{[}$

and the input is ${]}$. Since the top does not pair off with the input, rule (4) forces the algorithm to reject.

Their clever insight is that it is possible to simulate the stack with much less space, than used by the naïve algorithm. Note, the stack can get as large as order ${n/2}$ in size for a string that is in the Dyck language: just have an input that starts with many ( and [ symbols.

The key idea is that the contents of the stack can be hashed, and in this manner the space requirements can be reduced from order ${n}$ to almost ${\sqrt n}$. The hashing is not completely straightforward—they need to carefully look at the types of motions of the stack. They must analyze how the stack can move up and down during the operation of the naïve algorithm. See their paper for the details.

The Connection to Language Theory

The Dyck language, named for Walther von Dyck, is fundamental to classic language theory. It is a special type of context-free language, called linear, but it is “universal” in a sense made precise by the beautiful Chomsky-Schützenberger Theorem:

Theorem: A language ${L}$ over ${\Sigma}$ is context-free if and only if for some finite alphabet ${\Gamma}$, there is a homomorphism ${h:\Gamma^{*} \rightarrow \Sigma^{*}}$ such that

$\displaystyle L = h(D \cap R)$

where ${D = \mathsf{DYCK}(\Gamma)}$, the Dyck language over ${\Gamma}$ and ${R}$ is a regular language over ${\Gamma}$.

The theorem is, of course, due to Noam Chomsky and Marcel-Paul Schützenberger.

This connection raises some interesting questions between their new streaming result and classic language theory. Can they use their method, for example, to give a streaming bound on all context-free languages?

I will end this discussion by pointing out that I have already discussed a close relative to the Dyck language. Consider the free group on ${2}$ letters. Given a word from the free group, some words, like ${ab^{-1}ba^{-1}}$ would trivially equal ${1}$. Let us call the language defined by such words ${D}$. ${D}$ is like ${\mathsf{DYCK}(2)}$ except that parentheses can cancel in either order: for example,

$\displaystyle ] \, ) \, ( \, [$

is now allowed. Sometimes the language ${D}$ is called, for obvious reasons, the two-sided Dyck language.

Zeke Zalcstein and I showed years ago that the following is true:

Theorem: There is an one-pass randomized streaming algorithm for the language ${D}$ with space ${O(\log n)}$ and time ${\mathsf{polylog}(n)}$. If the stream belongs to the free group then the algorithm accepts it with probability ${1}$; otherwise it rejects it with probability at least ${1-n^{-c}}$ where ${c>0}$ is a constant.

Well not exactly. As Frédéric Magniez, Claire Mathieu, and Ashwin Nayak generously point out, we proved a theorem that is equivalent to this, but we proved our theorem before there was a even a streaming model. I discussed our result and its strange history here.

Open Problems

Is their upper bound tight? More fun might be to try and see what is the streaming complexity of other languages. In particular, what happens for all context-free languages? What about linear ones?

Why is the Dyck language so much harder in the streaming model than the free group, the two-sided Dyck language?

Also their stack idea is quite ingenious—can similar ideas be used in other algorithms?

November 22, 2009 10:58 am

Rather than the one-sided Dyck languages, there is the “hardest” context-free language of Sheila A. Greibach, SIAM J. Comput. 2(4):304–310, 1973.

• November 22, 2009 11:54 am

Yes. Extensions to hardest CFL would be interesting. I would like to point out that hardest CFL is used by Sudborough to give a machine class equivalence of the complexity class LogCFL.

Ivan Hal Sudborough. On the tape complexity of deterministic context-free languages. J. ACM,
25(3):405–414, 1978.

2. November 22, 2009 11:57 am

It’s amazing that mathematicians and scientists are still using TeX/LaTeX in 2009 — especially when there is a much better alternative, namely TeXmacs. TeXmacs is not just a front-end to TeX/LaTeX. Rather it has its own high quality real-time typesetting engine with WYSIWYG editing. See http://www.texmacs.org/Samples/texmacs.pdf

3. November 22, 2009 3:07 pm

The fact that the free group has faithful, finite-dimensional representations deserves to be better known – your DLOG result is very nice. Of course, with a quantum computer one can do the same thing in _constant_ space: just choose two uniformly random unitaries A and B, in any number of dimensions d > 1, apply A, A^-1, B, or B^-1 in the appropriate order, and check to see if you’re close to whatever initial state you started with. If the word is not the identity, then the final state is uniformly random, and so you’ll reject with probability 1-1/d.

I guess the reason this sort of thing doesn’t work for the Dyck language is that semigroups are harder to represent with finite-dimensional matrices than groups are. Does the Dyck semigroup have any faithful finite-dimensional representations?

• November 23, 2009 10:35 pm

To rudely answer my own question: the Dyck semigroup has no faithful finite-dimensional representations, since the ideals starting with ), )), ))), and so on must correspond to an infinite descending sequence of subspaces.

December 29, 2009 4:50 pm

Following an email correspondence with Cris Moore about his comment here, I think we have agreed that his quantum algorithm can’t be carried out in constant space, since the precision required during the application of the randomly selected unitaries increases with input length. In fact, there is a folk theorem that says that no one-pass quantum streaming algorithm can recognize a nonregular language with bounded error.

December 29, 2009 6:25 pm

Did wonder if it was okay to select constant number bit matrices. Can you do better than classic still?

December 30, 2009 4:55 pm

Oops, I should have written “In fact, there is a folk theorem that says that no one-pass quantum streaming algorithm can recognize a nonregular language with bounded error IN CONSTANT SPACE.”.

November 23, 2009 4:03 pm

Le français est complexe, sans doute !!

Speaking about problems of words on groups (in this case not a recognition one but an optimization problem)i´ve been shocked recently by the following result from Jerrum: “The complexity of finding minimum-lenght generator sequences”. It is not NP-complete…it is PSPACE-complete even when the generators set is restricted to two permutations !

As the author explains, “An interesting feature of this problem is that it does not fall under the headings of ‘two person games’ or ‘formal languages’ which cover the great majority of known
PSPACE – complete problems”.

June 23, 2010 6:42 am

I have a question. Does anyone know if real-time Turing machines (one-pass streaming algorithms which spend one step on each input symbol) that use only sublinear workspace are any better than real-time Turing machines with NO worktape?

What if the space bound is sublogarithmic?