High Dimensional Search and the NN Problem
The early days of computational geometry and the nearest neighbor problem
David Dobkin is one of the founders of Computational Geometry (CG). This initial work was done while we both were assistant professors at Yale, back in the mid 1970′s–a mere three decades ago. David also did seminal work on matrix multiplication for his PhD. under Roger Brockett at Harvard. Brockett has a great saying:
“It’s hard to argue with a counter-example.”
I have always liked this quote.
One of David’s coolest results, on matrix multiplication, was a proof that the product of an by matrix and a vector could be done in arithmetic operations. This was post Volker Strassen’s great result, but still is an amazing theorem, with an extremely intricate proof.
I have already discussed some lower bound work that we did together on the knapsack problem, today I plan on discussing some upper bound work that we also did. This work, on search in high dimensional spaces, introduced several new concepts to CG.
I have an uncountable number of David stories, stored in my finite number of neurons–is that possible? I could tell you about our periodic ritual of writing out “napkins” that predicted the future of the computer science department at Yale, our monthly poker game for penny stakes, or how Mary-Claire van Leunen worked with us on improving our writing skills.
Mary-Claire, was a member of the support staff at Yale, even though she was way overqualified–she had an M.A. in English. Later she wrote a well received book on technical writing, in which many samples of my prose were prominently displayed. Great. Unfortunately, they were all used as examples of “do not do this–ever.” I like to think that she helped me improve my writing.
David has always been a joy to work with, and back in the 1970′s working was different than it is today. Now I sit in an easy chair with my laptop, and type my blog post into an editor. I then click a button and through the magic of \TeX\, and \LaTeX\, it is typeset and displayed in seconds on the screen. If I want to add anything, or fix an error, I just start typing again. On and on. Writing is almost fun.
When David and I wrote our paper things were quite different. First we wrote the paper out by hand, then, we got into David’s VW bug. The original–I am a hippy from the ’60′s–VW bug, and drove across New Haven to a private home. There we rang the door bell, and were met by our typist, who was usually holding one of her young children in her arms. We handed her the manuscript, and she told us, “I can do this by Friday.” And so on.
Manuscripts were typed on typewriters, by typists. Special symbols came from inserting metal tools, one for each symbol, into the typewriter so that oh just forget it. You get the idea: it’s a miracle we could get anything done.
Our first joint paper at STOC was on searching in arbitrary dimension Euclidean spaces.
The way we approached searching was to think about the one-dimensional case, which is solved by the classic binary search method. At an abstract level binary search shows that given points in a one-dimensional space, there is a data structure that allows queries of the form, which is closest to the point ? Binary search does this by building a data structure, and then using the data structure to solve the query problem efficiently. The data structure is, of course, nothing more than the points arranged in sorted order, while the query is performed by using binary search to locate in the sorted list.
We divided the cost of this solution into two parts. The first was the pre-processing costs, which in this case is the time required to sort the points: . The second was the query time, which in this case is: .
Then, we tried to generalize this to higher dimensions: initially the plane, then three space, and finally the case of dimensions. Immediately, there was a fundamental question: what is the correct generalization? The right one, was not instantly clear to us, even for the simple case of the plane.
However, after a while we realized that the correct generalization to the plane was to replace points by lines. Since the lines divide the plane into bounded and unbounded regions, the search question became the following: find the region that contains the query point. More formally, the problem became: Given lines in the plane, is there an efficient data structure that allows queries of the form: which region of the plane, formed by the lines, contains the query point ?
We found that there was a data structure that solved this problem with pre-processing cost at most and query time . This was nice; although the pre-processing time increased quite a bit, the query time was still logarithmic. The method of proof became known later on as the slab method, more on that later.
I would like to say that we immediately stated and then proved the general theorem:
Theorem: Given any hyperplanes in dimensional Euclidean space, there is a data structure that can locate where a point is in time. Moreover, the pre-processing time of this data structure is
The truth is we struggled with even the three dimensional case. Finally, we decided to try to use advanced visualization aids. We took pieces of cardboard and some empty cardboard boxes and played around in my office, until we figured out what was going on. I still recall David holding a box and me trying to hold three pieces of cardboard, all in an effort to “see” what was happening. I have to add that David’s geometric intuition is terrific, while mine was and is very poor. But, eventually, we made progress.
Slab Method and Improvements
There has been a huge amount of work on refining our theorem–see this. There were two reasons: first our pre-processing cost was huge, and second while the query time was logarithmic the constant was also huge: the query time was actually order . Many improvements were made starting with the work of Kenneth Clarkson, and later Stefan Meiser. Eventually, the pre-processing came down to a still huge order and a query time of . In special cases, such as the plane, even better results are known.
One concept that we introduced to CG, but we did not name, was the slab method. This idea continues to see applications in a variety of geometry problems. The key insight to this method is simple.
In general lines cannot be totally ordered, but when they are confined to a “slab” so that there are no crossings, they can be easily ordered. It is such a simple idea, but a very powerful one.
Today improvements continue. A beautiful example, is the recent work of Mihai Pătraşcu, in which he proves that query time can even be sub-logarithmic time. In order to prove this he needs to work on a grid and in the RAM model. However, the result is quite neat and not at all obvious, in my opinion. His proof uses a kind of slab argument, but of course to get sub-logarithmic time, he must be more clever and cannot simply use binary search on the line segments in the slab. See his paper for details.
Exact Nearest Neighbor
One of my favorite problems, that is still open, is the nearest neighbor (NN) problem. In this problem a set of points are given in dimensional Euclidean space. The problem is to find the pair of points that are closest in the usual Euclidean distance.
The obvious algorithm is to compare all pairs of points, which clearly takes time . Since this time can be huge, the usual approaches are to change the problem. One approach is to consider that the points come from some random distribution.
Another, more popular approach is to allow an approximate answer. The goal, thus, becomes to find a pair of points whose distance is within an of the best pair’s distance, where the error is multiplicative. There is a vast literature of heuristics, theorems, algorithms, and implementations of this approach. I will not say more about it today.
Note, NN is related to the previous work on hyperplanes. If we can pre-process the points, then we can tell for any new point which point it is closest to in logarithmic time. Thus, one might expect that similar ideas from the previous work could be used to attack NN.
However, to my knowledge, the exact NN problem remains unsolved. I have asked one of the top experts, Piotr Indyk, who was kind enough to point out that one can do slightly better than brute force, . For large dimensions , the best known upper bound is of the form where . This is obtained by using fast matrix multiplication. The key insight is that for the Euclidean distance, it suffices to compute the dot product of vectors. This relies on the fact that the square of the distance between two vectors and is,
Then, the dot products are computed by squaring an by matrix.
Solve the exact nearest neighbor problem. Please. Approximations, may be fine for practical situations, but this is a fundamental problem of CG. It is hard to believe that after decades of work, we still are in the dark on how to solve NN exactly.
Another approach could be to use the matrix product connection. I have often wondered if there could be a connection to FFT technology because of the dot formula for distance.
Finally an approach, perhaps, is to use the known approximate methods to act as a filter, that would then allow an efficient search of the remaining pairs. Perhaps.