Skip to content

Making Learning Less Shattering

January 19, 2014


COLT deadline is Fri. Feb. 7

Vapnik_web
Columbia University source.

Vladimir Vapnik is one of the founding visionaries of Computational Learning Theory. His papers in 1963 with his advisor Aleksandr Lerner and in 1964 with Aleksey Chervonenkis are considered foundational for the Support Vector Machine model, which Vapnik himself ushered into its modern form in 1995 in joint work with Corinna Cortes. Vapnik and Chervonenkis got their initials into the theory walk of fame with the concept of VC-dimension in statistical classification. A 2008 interview with him titled “Learning Has Just Started” is still featured on the permanent page of the Computational Learning Theory (COLT) conferences and association.

Today Ken and I wish to note that my colleague Nina Balcan is chairing the COLT 2014 program committee, and talk about ways we might simplify learning—at least learning our field.

COLT is being held in Barcelona, Spain, on June 13-15, 2014. This year is the {{3^{3}}^{rd}} in the series of this wonderful conference on computational aspects of learning theory. I have always liked this area, though I have only published a few papers on learning theory. I have a strange—unique?—view that this area is in many ways the inverse of cryptography. In crypto we try to make it hard to determine information from “experimental data,” while learning is all about making it easy to determine, at least approximately—information again from data. Indeed the main paper I have in a COLT, years ago, was on using the hardness of a learning problem to construct a crypto-system.

Ken and I have been thinking about learning a lot recently, not just about MOOCs or in our classes but actually writing a textbook on quantum computation. We started the text with the idea that we were going to “non-teach” over half of its traditional subject matter, and keep it under 100 pages, to avoid what we hear of as obstacles on the way to its beautiful core. Well some of the traditional elements have crept back in, and it’s pushing 140 pages, but we believe it amplifies by how it simplifies. We non-teach some of complexity theory as well, for instance by not writing the name of any complexity class until the last chapter. Across the board, of course, there is the problem of compacting all the rapidly expanding fields of computer science to keep the undergraduate curriculum manageable.

Shattering and VC

The VC dimension was, of course, the brilliant notion due to Vapnik and Chervonenkis. It is usually defined via the notion of shattering sets. I find this way of defining the VC dimension confusing. Perhaps I am just not an expert, not a regular user of this technology, or just slow. But I always find the usual definition confusing.

Here is one form of it, which we actually take from Wikipedia’s page on the independent and simultaneous ideas of Norbert Sauer and Saharon Shelah published just a year after VC’s paper:

A set {T} is shattered by a family {\cal F} of sets if for every subset {T'} of {T}, there is a set {S} in {\cal F} such that {T' = S \cap T}. The VC-dimension of {\cal F} is the size of the largest {T} that it shatters.

The word “shatter” was not used originally by VC themselves, at least not in English. John Michael Steele, Professor of Statistics at Wharton Business School, believes that he himself introduced the term. It certainly sticks in the memory. Some terms don’t. Ken completely forgot that he coined the term “span program” when Uwe Schöning visited as his external examiner at Oxford in 1986, until Schöning reminded him over a decade later.

I am probably in the minority being confused by “shattering,” so you are welcome to skip the rest of this discussion. However, I will try to explain my view of VC dimension and hope that it makes it clearer to some, and at least not more confusing to any.

Let’s try.

A Little Dimension Theory

What is the dimension of a space? Forgot about VC-dimension for a moment. Let’s ask, what is the dimension of a space? In general how do you define the correct dimension of a space? We expect it to be a whole number—we will not talk about ideas of fractional or negative dimension here.

One property we expect is that the notion be monotone: if if {S} has dimension {d} and {S \subseteq T}, then {T} has dimension at least {d}.

The key point is that we can define our notion first for simple spaces, ones where the “right” dimension value is clear, and extend it bu the monotone rule to all other spaces. This is what we plan to do with VC-dimension.

A Little Learning Learning Theory Theory

For us let the spaces to which we wish to assign dimension comprise sets of functions of the form

\displaystyle  f: X \rightarrow \{0,1\}.

Let’s define {\mathsf{MAP}(X)} to be the class of such maps from {X} to {\{0,1\}}. Our goal is to define a notion of dimension for subsets of {\mathsf{MAP}(X)}. How might we do this?

Well for {\mathsf{MAP}(X)} itself the idea is simple: Define the dimension of {\mathsf{MAP}(X)} to be the cardinality of {X}. If {X} is infinite define it to be {\infty}. This is a pretty simple, pretty natural definition. Then suppose

\displaystyle  {\cal F} \subseteq \mathsf{MAP}(X).

To define the dimension of {{\cal F}}, look for the biggest simple subset that it has. That is, look for the biggest {Y} such that

\displaystyle  \mathsf{MAP}(Y) \subseteq {\cal F}.

Then call {|Y|} the dimension of {\cal F}. Clearly this definition satisfies our basic criteria of giving whole-number values and being monotone. Is it a good definition for the dimension of mappings?

Well I claim that it is just the VC-dimension in disguise. Assuming this claim for a moment, I believe that the definition is completely clear to me now. The set of all maps from {X} to {\{0,1\}} should have the dimension of the size of the set {X}. Seems pretty simple. Then extending it by saying that an arbitrary set of maps gets the dimension of the largest such subset seems natural.

So is this really VC-dimension? Where has the notion of shattering gone?

Example and Reinforcement

Let’s look at an example first. Suppose that {\cal F} is the set of maps from {\mathbb{R}^{2}}, the plane, to {\{0,1\}} that are defined by a line. Given a line we assign points on or above the line {1} and those below {0}.

We claim that the dimension of {\cal F} is 3. Let {Y} be any subset of three points in general position in the plane. Then

\displaystyle  \mathsf{MAP}(Y) \subset {\cal F}.

This follows by observing that in case of any map {f: Y \rightarrow \{0,1\}} that assigns two points one value and the third point the other value, the third point can be separated by a line from the other two, while the two constant maps are given by lines that run above or below. So {\mathsf{MAP}(Y) \subset {\cal F}}.

But for any set {Z} of four points, if one is interior then the map sending it to {1} and the others to {0} is not determined by a line. Otherwise they form a rhombus, and then the maps giving {1} to one diagonal pair and {0} to the other cannot be defined by one line. So {\mathsf{MAP}(Z)} cannot be contained in {{\cal F}}. As Wikipedia pictures this:

WikipVCall


The answer to our question is that we have hidden the notion of “shattering” under the implicit use of the power-set notion when considering all maps from a set {X}. This is what defining {\mathsf{MAP}(X)} does. We prefer this mathematically because it makes the simple cases {\mathsf{MAP}(X)} the fundamental objects.

Admittedly it does not say as much about the power of {{\cal F}} to classify, which is the subject’s motivation. However we still believe it is good to get the basic mathematical objects down first, and then give the motivation for building on them. At least it is good to have a second perspective, and if the proof of its equivalence to VC-dimension is now clear to you, that acts as a useful reinforcement. As Vapnik himself said in the COLT interview:

For example, when you have a technical description x of the object and have some impression x* about this object you have two forms of description: a formal description and a holistic description or Gestalt description. Using both descriptions during training can help to find a better decision function.

Open Problems

Does this definition and discussion help? Can you see where shattering went?

About these ads
12 Comments leave one →
  1. January 21, 2014 3:09 pm

    Dear Drs. Lipton and Regan,

    Making Mathematics more shattering, with the size of the Hilbert Hotel’s Computer and the inconsistency of the Zermelo-Fraenkel Set Theory (ZFC): http://www.andrebarbosa.eti.br/The Size of the Hilbert Hotel Computer.pdf

  2. January 21, 2014 5:14 pm

    Nice post. What is then especially remarkable is that having a low dimension is *sufficient* to get positive results for learning. At least in a worst-case sense, it is clear that if you have fewer points than the VC-dimension, you will be in trouble (e.g., if your data points are uniformly chosen from some Y s.t. MAP(Y) \subseteq F and the target f is uniformly chosen from MAP(Y), then you have no hope to predict f better than guessing on any new point you haven’t seen yet). What the VC-dimension upper bounds say is that in a sense this is the only barrier: for any data distribution, and any unknown target in F, once you get sufficiently more than VC-dimension examples, you *can* predict well on new points.

  3. January 21, 2014 7:05 pm

    Ken [Regan] and I [Dick Lipton] have been actually writing a textbook on quantum computation.

    This is wonderful news! A great subject can raise mortal writers to divine heights, and great writers can raise ordinary subjects to divine heights … and therefore (as it seems to me) this book promises to be doubly terrific. Now my gift-giving problems are solved!

  4. January 22, 2014 12:16 am

    That’s an interesting and elegant view of VC-dimension! I wonder if there is a similar interpretation of the Rademacher Complexity measuring the dimension of functions with real values.

    • rjlipton permalink*
      January 22, 2014 9:37 am

      yingyuliang

      I am not sure, but nice question. We will think about it.

  5. January 22, 2014 11:54 am

    I had already understood VC-dimension the usual way, but I think this was very helpful. In particular, I have been asked in conversation, “What’s that VC dimension thing all about?” and found myself stumbling over my words. This version is nice in that you can explain it without paper and pencil, and it has all the intuitive value of such things.

  6. shai permalink
    January 28, 2014 12:22 am

    It sounds nice. However, one should note that this definition of the VCDim is problematic from a formalist point of view. MAP(X) is defined as a set of functions over the domain set X. MAP(Y) should therefore consist of different objects (formally) – functions over the domain Y. Therefore, for any Y different than X, a subset of MAP(X) will never contain the set MAP(Y). To fix it, one has to talk about restrictions of functions to subdomains, and that may spoil the attractive simplicity of the proposed view of the VCdim.

    • January 28, 2014 8:59 pm

      Interesting point. In object-oriented type theory, extending the domain of the arguments specializes a function/method.

  7. February 5, 2014 6:46 pm

    Dear Prof. Lipton,

    Nice post. Just a few observations.

    Prof. Steele goes by his middle name. In fact until I read your post I did not know that the J in his name stood for “John.” He is definitely the one who invented the phrase “shattering.”

    The VC-dimension is for binary-valued mappings, as you have observed. For real-valued functions there are other dimensions, including (I kid you not) the “fat-shattering dimension.” Sounds like a diet, doesn’t it? (Melt those pounds away with our low fat-shattering dimension!)

    To me the key point about VC-dimension is not the dimension itself, but what happens when one tries to pick out subsets of some set S whose cardinality is larger than d, the VC-dimension. This is where Sauer’s lemma tells us that the number cannot be larger than (en/d)^d, where n = |S| and e is the base of the natural logarithm. Since there are 2^n subsets of S, the fraction of subsets of S that can be picked out by intersecting with the collection of sets of VC-dimension d asymptotically goes to zero as n gets larger. This is very useful in trying to determine whether a large set of points actually has some structure.

    To illustrate, the set of half-planes in k-dimensional space has VC-dimension k+1. Suppose you have a set of n points in k-dimensional space, with labels of +1 or -1 assigned to them. Then, out of the 2^n possible labellings, at most (en/d)^d where d = k+1 are “linearly separable”, that is, have all the +1 points on one side and the -1 points on the other side. So if you do manage to find that your set of labelled data is linearly separable, you can assert that the likelihood of this happening by pure chance is no more than (en/d)^d, or if n is large enough, the data really does have some “structure.” The argument can be extended easily to the situation where “most” of the n points are correctly classified.

    Sorry for the “lecture” and I hope I have contributed something useful. And I really do enjoy your blog.

  8. February 5, 2014 6:53 pm

    Apologies! In the next to last paragraph, the probability is (en/d)^d 2^-n, but people would have figured that out.

  9. Nina Balcan permalink
    February 14, 2014 12:21 am

    Thanks Dick and Ken for your post! We got a record number of submissions on a wide variety of interesting topics.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,022 other followers