# Making Learning Less Shattering

* COLT deadline is Fri. Feb. 7 *

Columbia University source. |

Vladimir Vapnik is one of the founding visionaries of Computational Learning Theory. His papers in 1963 with his advisor Aleksandr Lerner and in 1964 with Aleksey Chervonenkis are considered foundational for the Support Vector Machine model, which Vapnik himself ushered into its modern form in 1995 in joint work with Corinna Cortes. Vapnik and Chervonenkis got their initials into the theory walk of fame with the concept of VC-dimension in statistical classification. A 2008 interview with him titled “Learning Has Just Started” is still featured on the permanent page of the Computational Learning Theory (COLT) conferences and association.

Today Ken and I wish to note that my colleague Nina Balcan is chairing the COLT 2014 program committee, and talk about ways we might simplify learning—at least learning our field.

COLT is being held in Barcelona, Spain, on June 13-15, 2014. This year is the in the series of this wonderful conference on computational aspects of learning theory. I have always liked this area, though I have only published a few papers on learning theory. I have a strange—unique?—view that this area is in many ways the inverse of cryptography. In crypto we try to make it hard to determine information from “experimental data,” while learning is all about making it easy to determine, at least approximately—information again from data. Indeed the main paper I have in a COLT, years ago, was on using the hardness of a learning problem to construct a crypto-system.

Ken and I have been thinking about learning a lot recently, not just about MOOCs or in our classes but actually writing a textbook on quantum computation. We started the text with the idea that we were going to “non-teach” over half of its traditional subject matter, and keep it under 100 pages, to avoid what we hear of as obstacles on the way to its beautiful core. Well some of the traditional elements have crept back in, and it’s pushing 140 pages, but we believe it *amplifies* by how it *simplifies*. We non-teach some of complexity theory as well, for instance by not writing the name of any complexity class until the last chapter. Across the board, of course, there is the problem of compacting all the rapidly expanding fields of computer science to keep the undergraduate curriculum manageable.

## Shattering and VC

The VC dimension was, of course, the brilliant notion due to Vapnik and Chervonenkis. It is usually defined via the notion of shattering sets. I find this way of defining the VC dimension confusing. Perhaps I am just not an expert, not a regular user of this technology, or just slow. But I always find the usual definition confusing.

Here is one form of it, which we actually take from Wikipedia’s page on the independent and simultaneous ideas of Norbert Sauer and Saharon Shelah published just a year after VC’s paper:

A set is

shatteredby a family of sets if for every subset of , there is a set in such that . The VC-dimension of is the size of the largest that it shatters.

The word “shatter” was not used originally by VC themselves, at least not in English. John Michael Steele, Professor of Statistics at Wharton Business School, believes that he himself introduced the term. It certainly sticks in the memory. Some terms don’t. Ken completely forgot that he coined the term “span program” when Uwe Schöning visited as his external examiner at Oxford in 1986, until Schöning reminded him over a decade later.

I am probably in the minority being confused by “shattering,” so you are welcome to skip the rest of this discussion. However, I will try to explain my view of VC dimension and hope that it makes it clearer to some, and at least not more confusing to any.

Let’s try.

## A Little Dimension Theory

What is the dimension of a space? Forgot about VC-dimension for a moment. Let’s ask, what is the dimension of a space? In general how do you define the correct dimension of a space? We expect it to be a whole number—we will not talk about ideas of fractional or negative dimension here.

One property we expect is that the notion be **monotone**: if if has dimension and , then has dimension at least .

The key point is that we can define our notion first for simple spaces, ones where the “right” dimension value is clear, and extend it bu the monotone rule to all other spaces. This is what we plan to do with VC-dimension.

## A Little Learning Learning Theory Theory

For us let the spaces to which we wish to assign dimension comprise sets of functions of the form

Let’s define to be the class of such maps from to . Our goal is to define a notion of dimension for subsets of . How might we do this?

Well for itself the idea is *simple*: Define the dimension of to be the cardinality of . If is infinite define it to be . This is a pretty simple, pretty natural definition. Then suppose

To define the dimension of , look for the biggest *simple* subset that it has. That is, look for the biggest such that

Then call the dimension of . Clearly this definition satisfies our basic criteria of giving whole-number values and being monotone. Is it a good definition for the dimension of mappings?

Well I claim that it is just the VC-dimension in disguise. Assuming this claim for a moment, I believe that the definition is completely clear to me now. The set of all maps from to should have the dimension of the size of the set . Seems pretty simple. Then extending it by saying that an arbitrary set of maps gets the dimension of the largest such subset seems natural.

So is this really VC-dimension? Where has the notion of shattering gone?

## Example and Reinforcement

Let’s look at an example first. Suppose that is the set of maps from , the plane, to that are defined by a line. Given a line we assign points on or above the line and those below .

We claim that the dimension of is 3. Let be any subset of three points in general position in the plane. Then

This follows by observing that in case of any map that assigns two points one value and the third point the other value, the third point can be separated by a line from the other two, while the two constant maps are given by lines that run above or below. So .

But for any set of four points, if one is interior then the map sending it to and the others to is not determined by a line. Otherwise they form a rhombus, and then the maps giving to one diagonal pair and to the other cannot be defined by one line. So cannot be contained in . As Wikipedia pictures this:

The answer to our question is that we have hidden the notion of “shattering” under the implicit use of the power-set notion when considering all maps from a set . This is what defining does. We prefer this mathematically because it makes the simple cases the fundamental objects.

Admittedly it does not say as much about the power of to *classify*, which is the subject’s motivation. However we still believe it is good to get the basic mathematical *objects* down first, and then give the motivation for building on them. At least it is good to have a second perspective, and if the proof of its equivalence to VC-dimension is now clear to you, that acts as a useful *reinforcement*. As Vapnik himself said in the COLT interview:

For example, when you have a technical description

xof the object and have some impressionx*about this object you have two forms of description: a formal description and a holistic description orGestaltdescription. Using both descriptions during training can help to find a better decision function.

## Open Problems

Does this definition and discussion help? Can you see where shattering went?

Dear Drs. Lipton and Regan,

Making Mathematics more shattering, with the size of the Hilbert Hotel’s Computer and the inconsistency of the Zermelo-Fraenkel Set Theory (ZFC): http://www.andrebarbosa.eti.br/The Size of the Hilbert Hotel Computer.pdf

The correct link: http://www.andrebarbosa.eti.br/The_Size_of_the_Hilbert_Hotel_Computer.pdf

Nice post. What is then especially remarkable is that having a low dimension is *sufficient* to get positive results for learning. At least in a worst-case sense, it is clear that if you have fewer points than the VC-dimension, you will be in trouble (e.g., if your data points are uniformly chosen from some Y s.t. MAP(Y) \subseteq F and the target f is uniformly chosen from MAP(Y), then you have no hope to predict f better than guessing on any new point you haven’t seen yet). What the VC-dimension upper bounds say is that in a sense this is the only barrier: for any data distribution, and any unknown target in F, once you get sufficiently more than VC-dimension examples, you *can* predict well on new points.

This is wonderful news! A great subject can raise mortal writers to divine heights, and great writers can raise ordinary subjects to divine heights … and therefore (as it seems to me) this book promises to be

doublyterrific. Now my gift-giving problems are solved!That’s an interesting and elegant view of VC-dimension! I wonder if there is a similar interpretation of the Rademacher Complexity measuring the dimension of functions with real values.

yingyuliang

I am not sure, but nice question. We will think about it.

I had already understood VC-dimension the usual way, but I think this was very helpful. In particular, I have been asked in conversation, “What’s that VC dimension thing all about?” and found myself stumbling over my words. This version is nice in that you can explain it without paper and pencil, and it has all the intuitive value of such things.

It sounds nice. However, one should note that this definition of the VCDim is problematic from a formalist point of view. MAP(X) is defined as a set of functions over the domain set X. MAP(Y) should therefore consist of different objects (formally) – functions over the domain Y. Therefore, for any Y different than X, a subset of MAP(X) will never contain the set MAP(Y). To fix it, one has to talk about restrictions of functions to subdomains, and that may spoil the attractive simplicity of the proposed view of the VCdim.

Interesting point. In object-oriented type theory, extending the domain of the arguments

specializesa function/method.Dear Prof. Lipton,

Nice post. Just a few observations.

Prof. Steele goes by his middle name. In fact until I read your post I did not know that the J in his name stood for “John.” He is definitely the one who invented the phrase “shattering.”

The VC-dimension is for binary-valued mappings, as you have observed. For real-valued functions there are other dimensions, including (I kid you not) the “fat-shattering dimension.” Sounds like a diet, doesn’t it? (Melt those pounds away with our low fat-shattering dimension!)

To me the key point about VC-dimension is not the dimension itself, but what happens when one tries to pick out subsets of some set S whose cardinality is larger than d, the VC-dimension. This is where Sauer’s lemma tells us that the number cannot be larger than (en/d)^d, where n = |S| and e is the base of the natural logarithm. Since there are 2^n subsets of S, the fraction of subsets of S that can be picked out by intersecting with the collection of sets of VC-dimension d asymptotically goes to zero as n gets larger. This is very useful in trying to determine whether a large set of points actually has some structure.

To illustrate, the set of half-planes in k-dimensional space has VC-dimension k+1. Suppose you have a set of n points in k-dimensional space, with labels of +1 or -1 assigned to them. Then, out of the 2^n possible labellings, at most (en/d)^d where d = k+1 are “linearly separable”, that is, have all the +1 points on one side and the -1 points on the other side. So if you do manage to find that your set of labelled data is linearly separable, you can assert that the likelihood of this happening by pure chance is no more than (en/d)^d, or if n is large enough, the data really does have some “structure.” The argument can be extended easily to the situation where “most” of the n points are correctly classified.

Sorry for the “lecture” and I hope I have contributed something useful. And I really do enjoy your blog.

Apologies! In the next to last paragraph, the probability is (en/d)^d 2^-n, but people would have figured that out.

Thanks Dick and Ken for your post! We got a record number of submissions on a wide variety of interesting topics.