An approach to clustering based on a new model of complexity

Nina Balcan is one of the top young researchers in computational learning theory. She is not only a strong problem solver, but has a vision of her field that is wonderful.

Today I want to talk about a breakthrough she has made with Avrim Blum and Anupam Gupta. Their work has created an entire new way of looking at problems, and I think it will be considered soon, if not already, as one of the great results of recent times.

Nina was hired over two years ago, then she spent a year at the Microsoft Labs in Boston. We—at Tech—held our collective breath hoping she would leave there and come “back” to Tech. She did, much to our delight.

Classic Complexity Models

One of the main contributions of complexity theory is the formalization of what it means for an algorithm to be efficient. Prior to modern complexity theory, there were algorithms, but there was no way to understand their performance abstractly. Of course, one could use a stop-watch approach, and people did. Algorithms were timed on benchmark data—this is still done today. But, the ability to compare the performance of algorithms as a mathematical notion is, in my opinion, one of the great contributions of the field.

In the following let ${\cal I}$ be a collection of problem instances. As usual ${n}$ will be used to denote the size of the problem.

${\bullet }$ Worst Case Model: This is the classic gold standard model we all know and love. Given an algorithm we judge its performance by how well it does on the worst possible instance ${I \in \cal I}$ of size ${n}$. I have discussed this model before here.

The strength of the model is also its weakness: sometimes problem instances are not worst case. Sometimes, it “feels” that instances have some additional structure or some additional properties—but it is hard to make this “feeling” precise.

${\bullet }$ Average Case Model: This is now a classic model whose formal version is due to Leonid Levin—see the survey by Andrej Bogdanov and Luca Trevisan.
The concept here is that additional structure is captured by assuming the problem instances come from a random distribution.

There are many pretty results in this area, but there seems to be something unrealistic about the model. How often are instances of a problem really selected according to a random distribution? This is the primary difficulty with this model of complexity—it is not easy to see why instances are random.

There are other models, but these are two of the main ones. In the next sections I will discuss the new model due to Nina and her colleagues.

A New Model

Nina has developed a new model that is currently called the BBG model after the three authors of the seminal paper: Balcan, Blum, and Gupta.

${\bullet }$ Shadow Case Model: This is the new model—the name is mine. The central idea is that in practice, problem instances are not selected by an all powerful adversary, nor are they selected by the flip of a coin. BBG argue instances are selected to have a certain formal property that allows approximation algorithms to work. I think we need a better name, but for now I will stay with “the Shadow Case model.”

I have always thought problem instances are neither worst case nor random, but have never been able to find a good replacement. BBG seem to have succeeded in finding the right way to model many situations. At first it seems like their idea might be circular: a problem instance is “good,” provided our algorithms work well on it. This might be true, but it seems useless for building a sound theory. The magic is BBG have a notion of what a “good” instance is, they can build a sound mathematical theory, further the theory is beautiful, and even better it works in practice. Hard to beat.

Application to Clustering

Here is how they make the Shadow Case model concrete for the classic problem of clustering. As always see their paper for the details and for the generalization to domains other than clustering.

They assume there is a metric space ${X}$ with a distance function ${d}$. You are given ${n}$ points from the space, and the task is to find a division of the points into ${k}$ sets, clusters. They assume there is a true clustering into sets,

$\displaystyle \mathbf C = C_{1}, \dots, C_{k}.$

Finding this cluster is usually very hard, so we almost always settle for a clustering

$\displaystyle \mathbf C' = C'_{1}, \dots, C'_{k}$

that is “near” the true clustering. The notion of near is quite simple—just count the fraction of points that are in the wrong cluster. This induces a natural distance function on set systems. The goal is to find a clustering so the distance to the true one is at most some small ${\varepsilon}$.

So far nothing new, nothing different: this is all standard stuff. Of course the clustering problem as set up is impossible to solve. Not just hard. Impossible, since we have no prior knowledge of the true clustering.

The key to making all this work is the following. Let ${\Phi}$ be any measure on the quality of a particular clustering. Say the Shadow Hypothesis holds for ${\Phi}$ with parameters ${(\alpha,\varepsilon)}$ provided: for any clustering ${\mathbf C' = C'_{1}, \dots, C'_{k}}$ with

$\displaystyle \Phi(\mathbf C') \le (\alpha)\cdot \mathrm{OPT}_{\Phi},$

then ${\mathbf C'}$ is within ${\varepsilon}$ of the true clustering ${\mathbf C}$.

There several remarks I need to make. Note, the definition is not meant to be checkable, since it refers to the ${\mathrm{OPT}_{\Phi} }$ which is almost always impossible to compute efficiently. Second, they use the assumption in a very clever way. They do not go and find a clustering that is close to the ${\Phi}$ optimum and then move it around somehow. No. They use the assumption purely in the background. The mere existence of the property is used to get an algorithm that works. This is the brilliance of their notion.

The reason I call it the Shadow Case Model should now be clear. It is not that I do not like BBG, but I wanted to highlight that there is something lurking in the “shadows.” The assumption talks about an optimum they never find, and an approximation to that optimum they never find either.

Here is a kind of lemma they can prove using the Shadow Hypothesis.

Lemma: Suppose a ${k}$-median clustering problem instance satisfies the Shadow Hypothesis with parameters ${(\alpha,\varepsilon)}$. Then, at most ${10\varepsilon n/\alpha}$ points have a distance to their true cluster of more than

$\displaystyle \frac{\alpha w_{\mathrm{avg}}}{10 \varepsilon},$

where ${w_{\mathrm{avg}}}$ is the average distance of each point to its true cluster.

Lemmas like this allow them to create simple and fast algorithms that get terrific clustering results. Again the measure ${\Phi}$, the optimal clustering, and even the approximation to it, all remain in the shadows.

This is very neat work—see their paper for how all the magic works.

Open Problems

There are two obvious open problems. What is the right name for this notion? Is BBG the right one? If they use Shadow they can use, perhaps, Shadow the Hedgehog as their logo:

The most important open problem is to start using the model on other areas of theory. This is already well underway, but I expect a flood of papers using their brilliant idea.

Finally, the complexity theorist in me wants to know if there are ways to use their model to define formal complexity classes? Is this possible?

1. May 17, 2010 2:47 pm

Hi Dick, there are many models of “semi-random” instances that are intermediate between worst case and average case, most notably the model of smoothed complexity.

What I find most appealing in their paper (and this gets a bit lost in the level of abstraction you propose) is that, in a clustering problem, we are not really interesting in optimizing a cost function, but we are interested in finding the clusters. (The cost function is usually sort of reverse-engineered to make the clusters emerge in the optimal and near-optimal solutions.)

So if no good clustering exists, this means that either we are working with the wrong cost function or the data itself has no useful clustering. In such a case, it is not so useful to be able to find near-optimal solutions, and so an algorithm that has good approximation ratio only on instances that have good solutions is exactly as useful as an algorithm that has good approximation ration on all inputs. In many other problems this is not really true, because the cost function represents a “real” cost, and we may be interested in approximate solutions even when we are working with instances that have no “good” solution.

May 17, 2010 4:17 pm

You said it very well. The measures that are introduced to help find the clusters are not as basic as measures in other problems.

Thanks

May 17, 2010 5:20 pm

When you feature a multi-author paper like this, how do you choose which author to picture?

May 18, 2010 8:02 am

Tom,

I picked Nina fro two reasons: I just spoke to hear about the work and I have featured Blum before. It usually a bit random, but I will feature her three co-author another time.

Good question.

May 17, 2010 11:08 pm

What do you think of Existential Complexity as a name? I think it captures the features, has a nice ring to it, is descriptive, and won’t be confused with other jargon.

May 18, 2010 3:16 am

“…sometimes problem instances are not worst case. Sometimes, it “feels” that instances have some additional structure or some additional properties—but it is hard to make this “feeling” precise….”

I think that this is the motivation of the parametrized approach of Downey and Fellows.

May 18, 2010 5:56 am

I think the moral of that paper is that you shouldn’t think too hard about approximation algorithms. They show roughly “if you have a problem for which each and every alpha approximation to the optimum is goodish, then here is a 1-line algorithm that will solve that problem perfectly.”

And I’d guess that the 1-line algorithm doesn’t do that too often. So there’s no point trying to get any old alpha approximation. Much less trying to show hardness of alpha approximation.

For the sake of much of modern theory, we shouldn’t let the paper get any publicity.

May 19, 2010 3:28 pm

I wouldn’t say it’s a 1-line algorithm – more like a 2-stage version of k-means where the first stage gives a somewhat greedy initial clustering and the second stage reclusters by median distance. In any event, it does suggest that for problems where the objective being optimized is really a “measurable proxy” for some other underlying goal, it makes sense to consider the implications of the implicit assumption that the two are related. It also suggests perhaps expanding beyond approximation-ratio as the sole measure of quality for problems of this type. E.g., suppose your algorithm achieves a good objective score *and* produces solutions with some other desirable criterion as well: in that case you are only assuming that solutions of *that* kind are high quality. We’ve been recently trying to look at some extensions of this form. On a different note, I think it might be interesting to consider something of this style for problems of evolutionary tree reconstruction, where again the measurable objectives are typically a proxy for what you really want, namely a correct tree.

Thanks, Dick, for the nice post!

6. May 18, 2010 7:12 am

Hi Dick.

Two quick comments, one technical and one subjective:

Technical: For the key lemma there are actually two parts. The part
you listed is the one that holds without assumptions, just by
definition of w_avg (by Markov’s inequality). The part that uses the
Shadow Hypothesis is the companion statement that says that at most
6*epsilon*n points can have distance to their second-closest cluster
center less than (alpha*w_avg)/(2*epsilon). The combination of
these two is what then allows one to find the clusters.

Subjective: I’m not sure we really argue that necessarily in practice
instances are selected to satisfy the condition. It’s just that the
approximation algorithms paradigm for these problems is implicitly
assuming that something of this form is true. (since if it’s false,
then why work so hard on approximating the objective!)

Thanks!

May 18, 2010 11:13 am

This reminds me of recent empirical work in protein folding. Rea proteins have a deep well around their minimum-energy configuration; the lowest local minimum has a much higher energy, and configurations whose energy is close to the minimum are in fact close to the configuration in some sense. (Indeed, protein sequences have evolved to have these properties, so that they are easy to fold, and don’t get stuck in local optima.) In a setting like this, any approximation algorithm which gets close to the optimum energy is actually getting close to the optimum configuration.

May 18, 2010 12:15 pm

typo: I meant “real” proteins.

8. May 19, 2010 6:34 am

This great topic unites two of my own (and many people’s) main interests. In line with Christopher Moore’s post on protein folding, for many months my quantum systems engineering colleagues and I have been attending the regular weekly meeting of a synthetic biology group; the idea being to learn (by direct observation) better answers to the Ferengi question: “How does synthetic biology work, exactly?”

The kind of answers we engineers are looking to find have everything to do with Dick’s fine blog in general, and the e BBG Shadow Case Model in particular: What kinds of proof technologies—whether explicit or implicit—are synthetic biologists exploiting nowadays? What kinds of dynamical simulation algorithms do they build upon those proofs? And particularly: what new notions of naturality in computational complexity theory can we abstract from this work?

What makes the BBG Shadow Case Model so interesting (to me) is that it points toward new notions of abstractable naturality in complexity theory. Everyone is familiar with the notion of reduction in complexity theory … essentially a certificate that your problem instance might (in the worst case) be encoding an NP-complete problem. The BBG Shadow Hypothesis seems to be a solid step in the opposite direction … a new kind of certificate that the problem instance is not encoding an NP-complete problem.

In practice, “I am not a computation” certificates are ubiquitous in synthetic biology … even though biologists usually do not (as yet) formulate them in a rigorous or abstractably natural way. Typically the certification is associated with a (classical) thermostat process or a (quantum) Lindblad process; in both cases these processes inject enough entropy that the desired “not a computation” dynamics are achieved. Even though synthetic biologists do not think of these noise processes as (effectively) inducing a Shadow Hypothesis, this might IMHO be a very good way to think about it.

It would be terrific for engineers in general, and synthetic biologists in particular, if the methods already in-use were better understood in terms of the elements that complexity theorists hold dear: proof technologies and dynamical simulation algorithms that are solidly grounded in (new?) notions of complexity-theoretic naturality.

My experience has been that it’s best not to approach these problems with too many preconceptions … older textbooks contain plenty of mumpsimus that (for students) can solidify into cognitive obstruction to perceiving what ‘s going on. That’s why there’s real excitement at the frontiers like DisProt: the Database of Protein DISorder … and in the recognition that, even taking the rapidly accelerating progress of synthetic biology into account, we are very far from approaching the (still unknown) limits to observing and simulating the classical and quantum dynamics of complex systems

We systems engineers urgently need, in particular, a better mathematical toolset for describing and appreciating the “naturality” of these challenges. So thanks for a blog, and a topic, and a BBG model, that is “naturally” of immense interest to us engineers!

9. May 19, 2010 11:30 am

PS: Oh yeah … one other thing we’ve been learning from the synthetic biologists, which connects to Dick’s topic challenge to “start using the (BBG) model on other areas of theory”.

The name synthetic biologists have for this strategy is repurposing … a term that synthetic biologists love with much the same ardor that mathematicians have for naturality. Acting over billions of years, Nature has ubiquitously repurposed almost every genetic and structural motif (often many times over), at every scale of dynamical and informatic complexity, from genes to biomes.

In effect, repurposing is a strategy that Nature *wants* synthetic biologists to adopt, in the same way that Nature (in effect) *wants* mathematical frameworks to be natural.

Although they don’t mean quite the same thing, “naturality” and “repurposing” *do* dovetail nicely, in the sense that (for example) a mathematical measure of complexity is natural if it can be repurposed to serve many practical applications. These are two concrete directions in which (hopefully) the ideas of the BBG article can be developed, and more broadly, they are two concrete reasons why the topics in this blog are (usually) so very fruitful from a systems engineering point-of-view.

So, thanks again for a *wonderful* post, about a wonderful article!

May 20, 2010 4:33 am

It strikes me that this has similarities to the idea of adaptive algorithms, where (usually) the time complexity is a function of some “difficulty measure” of the instance. Most of the work on adaptive algorithms has been in sorting and searching, but maybe it is time to do similar things for other types of problems. Another term that has been used is “opportunistic algorithms” that take advantage of features of the input. Not sure if any of these terms really capture the new idea here, though.

11. May 24, 2010 4:04 am

i search with keyword “paper folding” and found your site.. hahaha.. not match.. but it’s ok..

nice blog.. just passing by..

thank you.

12. May 25, 2010 11:24 am

There was a paper in ICS 2010 that seems to do something similar for MaxCut: “Are Stable Instances Easy?” by Yonatan Bilu and Nathan Linial. It argues that if one MaxCut (which can be thought of clustering a graph into 2 clusters) stands out as the best even if the graph is peturbed a little, than it can be found. Paper available here: http://www.cs.huji.ac.il/~nati/PAPERS/stable_instance.pdf

April 7, 2012 11:58 pm

Hi, Dick. I read this paper recently and consider some other problems related with this model. For example, like a paper ‘A Randomized Rounding Approach to the Traveling Salesman Problem’ appeared in FOCS 2011, which win the best paper award, breaks the 1.5 ratio for TSP by using the random sampling tree technique. I think probably if we assume that any c-approximation solution for TSP is \epsilon-close to the optimal solution (or the ground truth), a smaller ratio maybe achievable. Thanks.