A Negative Impossibility Theorem
A positive result on clustering
Ravi Kannan is a long-time friend, a brilliant theorist, and a wonderful speaker. He has won numerous awards, including the Fulkerson Prize and the Knuth Prize, for his ground-breaking research.
Today I wish to talk about a recent presentation that Ravi gave at Tech on clustering, which contained a negative impossibility theorem.
The theorem is from his book with John Hopcroft on algorithms for this century. This is a novel work that focuses on data not algorithms, is a labor of over five years, and may become the algorithm book. I have only had a short chance to look at the book, which is available on-line, but it looks exciting. Here is a quote from their own introduction that may help explain their unique view:
[W]e have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability and statistics. Some topics that have become important are high-dimensional data, large random graphs, singular-value decomposition along with other topics covered in this book.
See their actual book for details.
I cannot resist using a double negative in the title of this discussion. I should avoid that, but oh well. Of course a double negative is when we use two negations within the same sentence. In English they cancel out, but such sentences are considered poor style. Our friends at Wikipedia name Persian, Portuguese, Russian, Spanish, and Ukrainian as languages in which the double negative amplifies the negative content.
French is an intermediate case, and English has flipped from negation being additive to multiplicative. Wikipedia gives these examples from Geoffrey Chaucer’s Canterbury Tales:
About the Friar, he writes “Ther nas no man no wher so vertuous” (“There never was no man nowhere so virtuous”).
About the Knight, “He nevere yet no vileynye ne sayde / In all his lyf unto no maner wight” (“He never yet no vileness didn’t say / In all his life to no manner of man”).
Since we are having some fun let me quote an old joke:
A linguistics professor was lecturing to his class one day. In English, he said, a double negative forms a positive. In some languages though, such as Russian, a double negative is still a negative.
However, he pointed out, there is no language wherein a double positive can form a negative.
A voice from the back of the room piped up, “Yeah, right.”
Languages are complex. So let’s move back to clustering and the safety of more precise mathematical issues.
Just to be clear, leaving our jokes behind, I would like to define what we mean by clustering. Given a set of objects, clustering divides them into a set of disjoint groups so that elements in each group, called a cluster, are more similar to each other than to those in other clusters. Here is a simple example:
Clustering is used wherever there is data: we find clusters used in areas from science and engineering to policy making and politics. Unfortunately clustering itself is a hard area, because there does not seem to be one precise general definition of what a good clustering is. A consequence of this is there are many algorithms for clustering. Many. Wikipedia points out that there are over 100 clustering algorithms that have been published. This could be a low estimate.
An interesting idea might be to try and cluster the clustering algorithms to see You get the idea.
Clustering Is Impossible
Jon Kleinberg is famous for many things, but one is his pretty result on clustering. In the spirit of Kenneth Arrow’s impossibility theorem on good voting systems, Jon has proved that a reasonable clustering system is impossible.
Following Arrow, Jon states three simple axioms, and then shows that there is no clustering method that satisfies the axioms. Of course people run clustering algorithms every day, just as people make decisions via voting all the time, so one must take such negative results with some care. Indeed one of my recurrent points is that we must be very careful how we interpret negative results. See these discussions for some further thoughts.
Let’s turn to look at Jon’s theorem. He assumes that there are a set of objects with a metric defined on pairs of objects. Thus is the “distance” from to for any in . The metric satisfies the usual assumptions: (i) for any the value of is non-negative and is zero only if ; (ii) the distance is symmetric, ; and (iii) the distance satisfies the triangle inequality
Nothing surprising, nothing dangerous, nothing up his sleeve.
Then a clustering method is a function that assigns to any such set with its distance function a partition into clusters. Of course so far this is all easy: just assign everything to the same cluster. That works. But we must have some properties that we insist that the clustering method satisfy. The hope is that the properties, we will call axioms, restrict the method to be “reasonable.”
Scale Invariance: This says that replacing the distance by for any non-zero does not change the resulting clustering.
This seems very reasonable: put another way, you can measure distance in feet or in meters—the resulting clustering will not change. Who can argue with this?
Richness: This says that any partition of can be created by some metric on the points.
Again this seems reasonable. Given the points and a partition, since there is no structure on the points, there should be a metric that yields any partition we like. No? Seems safe, and again who can argue with this?
The final axiom is:
Monotonicity: This says when we shrink distances between points inside a cluster and expand distances between points in different clusters, we get the same result.
See Jon’s paper for a formal definition, and do note that he calls it “consistency.” I think it is more a monotone type property. Again it seems unassailable. Yet.
There is no clustering method that satisfies scale invariance, richness, and monotonicity.
Clustering Is Possible
Enter John and Ravi. They show that Kleinberg’s pretty negative theorem on clustering can be dodged, provided one is willing to change things a bit.
Let’s look at their theorem. For starters they assume that the points are a subset of some finite Euclidean space. The distance function is then, of course, the usual Euclidean distance.
They use the same two axioms that were used before: scale invariance and monotonicity. They replace Jon’s axiom of richness by the following:
Richness II: For any set of distinct points in the given Euclidean space, there is an and a set of of points such that the algorithm on input produces clusters, whose centers are the respective points in .
I by the way do not like using and , but I am following their notation. Again see their book for a more details—page 272.
With this modest alteration, they can now prove:
There is a clustering method that satisfies scale invariance, richness II, and monotonicity.
Thus after a change of no little reasonableness, good clustering is no longer impossible. In particular, this constitutes a proof that it is not possible to prove an impossibility result for clustering under this definition. Well to put it more positively, their proof is an algorithm. I will let the details serve as an advertisement for their book.
I hope you did not dislike this discussion too much. What other algorithms satisfy Richness II?