Making Learning Less Shattering
COLT deadline is Fri. Feb. 7
Columbia University source.
Vladimir Vapnik is one of the founding visionaries of Computational Learning Theory. His papers in 1963 with his advisor Aleksandr Lerner and in 1964 with Aleksey Chervonenkis are considered foundational for the Support Vector Machine model, which Vapnik himself ushered into its modern form in 1995 in joint work with Corinna Cortes. Vapnik and Chervonenkis got their initials into the theory walk of fame with the concept of VC-dimension in statistical classification. A 2008 interview with him titled “Learning Has Just Started” is still featured on the permanent page of the Computational Learning Theory (COLT) conferences and association.
Today Ken and I wish to note that my colleague Nina Balcan is chairing the COLT 2014 program committee, and talk about ways we might simplify learning—at least learning our field.
COLT is being held in Barcelona, Spain, on June 13-15, 2014. This year is the in the series of this wonderful conference on computational aspects of learning theory. I have always liked this area, though I have only published a few papers on learning theory. I have a strange—unique?—view that this area is in many ways the inverse of cryptography. In crypto we try to make it hard to determine information from “experimental data,” while learning is all about making it easy to determine, at least approximately—information again from data. Indeed the main paper I have in a COLT, years ago, was on using the hardness of a learning problem to construct a crypto-system.
Ken and I have been thinking about learning a lot recently, not just about MOOCs or in our classes but actually writing a textbook on quantum computation. We started the text with the idea that we were going to “non-teach” over half of its traditional subject matter, and keep it under 100 pages, to avoid what we hear of as obstacles on the way to its beautiful core. Well some of the traditional elements have crept back in, and it’s pushing 140 pages, but we believe it amplifies by how it simplifies. We non-teach some of complexity theory as well, for instance by not writing the name of any complexity class until the last chapter. Across the board, of course, there is the problem of compacting all the rapidly expanding fields of computer science to keep the undergraduate curriculum manageable.
Shattering and VC
The VC dimension was, of course, the brilliant notion due to Vapnik and Chervonenkis. It is usually defined via the notion of shattering sets. I find this way of defining the VC dimension confusing. Perhaps I am just not an expert, not a regular user of this technology, or just slow. But I always find the usual definition confusing.
Here is one form of it, which we actually take from Wikipedia’s page on the independent and simultaneous ideas of Norbert Sauer and Saharon Shelah published just a year after VC’s paper:
A set is shattered by a family of sets if for every subset of , there is a set in such that . The VC-dimension of is the size of the largest that it shatters.
The word “shatter” was not used originally by VC themselves, at least not in English. John Michael Steele, Professor of Statistics at Wharton Business School, believes that he himself introduced the term. It certainly sticks in the memory. Some terms don’t. Ken completely forgot that he coined the term “span program” when Uwe Schöning visited as his external examiner at Oxford in 1986, until Schöning reminded him over a decade later.
I am probably in the minority being confused by “shattering,” so you are welcome to skip the rest of this discussion. However, I will try to explain my view of VC dimension and hope that it makes it clearer to some, and at least not more confusing to any.
A Little Dimension Theory
What is the dimension of a space? Forgot about VC-dimension for a moment. Let’s ask, what is the dimension of a space? In general how do you define the correct dimension of a space? We expect it to be a whole number—we will not talk about ideas of fractional or negative dimension here.
One property we expect is that the notion be monotone: if if has dimension and , then has dimension at least .
The key point is that we can define our notion first for simple spaces, ones where the “right” dimension value is clear, and extend it bu the monotone rule to all other spaces. This is what we plan to do with VC-dimension.
A Little Learning Learning Theory Theory
For us let the spaces to which we wish to assign dimension comprise sets of functions of the form
Let’s define to be the class of such maps from to . Our goal is to define a notion of dimension for subsets of . How might we do this?
Well for itself the idea is simple: Define the dimension of to be the cardinality of . If is infinite define it to be . This is a pretty simple, pretty natural definition. Then suppose
To define the dimension of , look for the biggest simple subset that it has. That is, look for the biggest such that
Then call the dimension of . Clearly this definition satisfies our basic criteria of giving whole-number values and being monotone. Is it a good definition for the dimension of mappings?
Well I claim that it is just the VC-dimension in disguise. Assuming this claim for a moment, I believe that the definition is completely clear to me now. The set of all maps from to should have the dimension of the size of the set . Seems pretty simple. Then extending it by saying that an arbitrary set of maps gets the dimension of the largest such subset seems natural.
So is this really VC-dimension? Where has the notion of shattering gone?
Example and Reinforcement
Let’s look at an example first. Suppose that is the set of maps from , the plane, to that are defined by a line. Given a line we assign points on or above the line and those below .
We claim that the dimension of is 3. Let be any subset of three points in general position in the plane. Then
This follows by observing that in case of any map that assigns two points one value and the third point the other value, the third point can be separated by a line from the other two, while the two constant maps are given by lines that run above or below. So .
But for any set of four points, if one is interior then the map sending it to and the others to is not determined by a line. Otherwise they form a rhombus, and then the maps giving to one diagonal pair and to the other cannot be defined by one line. So cannot be contained in . As Wikipedia pictures this:
The answer to our question is that we have hidden the notion of “shattering” under the implicit use of the power-set notion when considering all maps from a set . This is what defining does. We prefer this mathematically because it makes the simple cases the fundamental objects.
Admittedly it does not say as much about the power of to classify, which is the subject’s motivation. However we still believe it is good to get the basic mathematical objects down first, and then give the motivation for building on them. At least it is good to have a second perspective, and if the proof of its equivalence to VC-dimension is now clear to you, that acts as a useful reinforcement. As Vapnik himself said in the COLT interview:
For example, when you have a technical description x of the object and have some impression x* about this object you have two forms of description: a formal description and a holistic description or Gestalt description. Using both descriptions during training can help to find a better decision function.
Does this definition and discussion help? Can you see where shattering went?