Skip to content

High Dimensional Search and the NN Problem

June 18, 2009


The early days of computational geometry and the nearest neighbor problem

images

David Dobkin is one of the founders of Computational Geometry (CG). This initial work was done while we both were assistant professors at Yale, back in the mid 1970’s–a mere three decades ago. David also did seminal work on matrix multiplication for his PhD. under Roger Brockett at Harvard. Brockett has a great saying:

“It’s hard to argue with a counter-example.”

I have always liked this quote.

One of David’s coolest results, on matrix multiplication, was a proof that the product of an {n} by {\log n} matrix and a vector could be done in {O(n)} arithmetic operations. This was post Volker Strassen’s great result, but still is an amazing theorem, with an extremely intricate proof.

I have already discussed some lower bound work that we did together on the knapsack problem, today I plan on discussing some upper bound work that we also did. This work, on search in high dimensional spaces, introduced several new concepts to CG.

I have an uncountable number of David stories, stored in my finite number of neurons–is that possible? I could tell you about our periodic ritual of writing out “napkins” that predicted the future of the computer science department at Yale, our monthly poker game for penny stakes, or how Mary-Claire van Leunen worked with us on improving our writing skills.

Mary-Claire, was a member of the support staff at Yale, even though she was way overqualified–she had an M.A. in English. Later she wrote a well received book on technical writing, in which many samples of my prose were prominently displayed. Great. Unfortunately, they were all used as examples of “do not do this–ever.” I like to think that she helped me improve my writing.

David has always been a joy to work with, and back in the 1970’s working was different than it is today. Now I sit in an easy chair with my laptop, and type my blog post into an editor. I then click a button and through the magic of \TeX\, and \LaTeX\, it is typeset and displayed in seconds on the screen. If I want to add anything, or fix an error, I just start typing again. On and on. Writing is almost fun.

When David and I wrote our paper things were quite different. First we wrote the paper out by hand, then, we got into David’s VW bug. The original–I am a hippy from the ’60’s–VW bug, and drove across New Haven to a private home. There we rang the door bell, and were met by our typist, who was usually holding one of her young children in her arms. We handed her the manuscript, and she told us, “I can do this by Friday.” And so on.

Manuscripts were typed on typewriters, by typists. Special symbols came from inserting metal tools, one for each symbol, into the typewriter so that {\dots} oh just forget it. You get the idea: it’s a miracle we could get anything done.

Multi-Dimensional Search

Our first joint paper at STOC was on searching in arbitrary dimension Euclidean spaces.

The way we approached searching was to think about the one-dimensional case, which is solved by the classic binary search method. At an abstract level binary search shows that given {n} points {x_{1}, \dots, x_{n}} in a one-dimensional space, there is a data structure that allows queries of the form, which {x_{i}} is closest to the point {q}? Binary search does this by building a data structure, and then using the data structure to solve the query problem efficiently. The data structure is, of course, nothing more than the points {x_{1}, \dots, x_{n}} arranged in sorted order, while the query is performed by using binary search to locate {q} in the sorted list.

We divided the cost of this solution into two parts. The first was the pre-processing costs, which in this case is the time required to sort the points: {O(n\log n)}. The second was the query time, which in this case is: {O(\log n)}.

Then, we tried to generalize this to higher dimensions: initially the plane, then three space, and finally the case of {d} dimensions. Immediately, there was a fundamental question: what is the correct generalization? The right one, was not instantly clear to us, even for the simple case of the plane.

However, after a while we realized that the correct generalization to the plane was to replace points by lines. Since the lines divide the plane into bounded and unbounded regions, the search question became the following: find the region that contains the query point. More formally, the problem became: Given {n} lines {L_{1}, \dots, L_{n}} in the plane, is there an efficient data structure that allows queries of the form: which region of the plane, formed by the lines, contains the query point {q}?

We found that there was a data structure that solved this problem with pre-processing cost at most {O(n^{2})} and query time {O(\log n)}. This was nice; although the pre-processing time increased quite a bit, the query time was still logarithmic. The method of proof became known later on as the slab method, more on that later.

I would like to say that we immediately stated and then proved the general theorem:

Theorem: Given any {n} hyperplanes in {d} dimensional Euclidean space, there is a data structure that can locate where a point {q} is in {O(\log n)} time. Moreover, the pre-processing time of this data structure is {O(n^{2^{d+1}}).}

The truth is we struggled with even the three dimensional case. Finally, we decided to try to use advanced visualization aids. We took pieces of cardboard and some empty cardboard boxes and played around in my office, until we figured out what was going on. I still recall David holding a box and me trying to hold three pieces of cardboard, all in an effort to “see” what was happening. I have to add that David’s geometric intuition is terrific, while mine was and is very poor. But, eventually, we made progress.

Slab Method and Improvements

There has been a huge amount of work on refining our theorem–see this. There were two reasons: first our pre-processing cost was huge, and second while the query time was logarithmic the constant was also huge: the query time was actually order {2^{d}\log n}. Many improvements were made starting with the work of Kenneth Clarkson, and later Stefan Meiser. Eventually, the pre-processing came down to a still huge order {n^{d/2}} and a query time of {d^{5}\log n}. In special cases, such as the plane, even better results are known.

One concept that we introduced to CG, but we did not name, was the slab method. This idea continues to see applications in a variety of geometry problems. The key insight to this method is simple.
Picture 1

In general lines cannot be totally ordered, but when they are confined to a “slab” so that there are no crossings, they can be easily ordered. It is such a simple idea, but a very powerful one.

Today improvements continue. A beautiful example, is the recent work of Mihai Pătraşcu, in which he proves that query time can even be sub-logarithmic time. In order to prove this he needs to work on a grid and in the RAM model. However, the result is quite neat and not at all obvious, in my opinion. His proof uses a kind of slab argument, but of course to get sub-logarithmic time, he must be more clever and cannot simply use binary search on the line segments in the slab. See his paper for details.

Exact Nearest Neighbor

One of my favorite problems, that is still open, is the nearest neighbor (NN) problem. In this problem a set of {n} points are given in {d} dimensional Euclidean space. The problem is to find the pair of points that are closest in the usual Euclidean distance.

The obvious algorithm is to compare all pairs of points, which clearly takes time {O(dn^{2})}. Since this time can be huge, the usual approaches are to change the problem. One approach is to consider that the points come from some random distribution.

Another, more popular approach is to allow an approximate answer. The goal, thus, becomes to find a pair of points whose distance is within an {\epsilon>0} of the best pair’s distance, where the error is multiplicative. There is a vast literature of heuristics, theorems, algorithms, and implementations of this approach. I will not say more about it today.

Note, NN is related to the previous work on hyperplanes. If we can pre-process the points, then we can tell for any new point which point it is closest to in logarithmic time. Thus, one might expect that similar ideas from the previous work could be used to attack NN.

However, to my knowledge, the exact NN problem remains unsolved. I have asked one of the top experts, Piotr Indyk, who was kind enough to point out that one can do slightly better than brute force, {O(dn^{2})}. For large dimensions {d}, the best known upper bound is of the form {O(d^{a}n^{2})} where {0<a<1}. This is obtained by using fast matrix multiplication. The key insight is that for the Euclidean distance, it suffices to compute the dot product of vectors. This relies on the fact that the square of the distance between two vectors {x} and {y} is,

\displaystyle  \sum_{i} x_{i}^{2} + \sum_{i} y_{i}^{2} - 2\sum_{i} x_{i}y_{i}.

Then, the dot products are computed by squaring an {n} by {d} matrix.

Open Problems

Solve the exact nearest neighbor problem. Please. Approximations, may be fine for practical situations, but this is a fundamental problem of CG. It is hard to believe that after decades of work, we still are in the dark on how to solve NN exactly.

Another approach could be to use the matrix product connection. I have often wondered if there could be a connection to FFT technology because of the dot formula for distance.

Finally an approach, perhaps, is to use the known approximate methods to act as a filter, that would then allow an efficient search of the remaining pairs. Perhaps.

About these ads
20 Comments leave one →
  1. Alex Nikolov permalink
    June 18, 2009 8:41 pm

    Hi! I might be missing something. For d=2 there is a classical deterministic n*log n algorithm for the closest pair of points. There is also the famous (and beautiful) randomized O(n) algorithm by Rabin, which places a grid on the plane. Can’t Rabin’s algorithm be generalized to higher d, to get a better dependence on n, at least?

  2. rjlipton permalink*
    June 18, 2009 9:20 pm

    My comments on NN are about large dimensions. Rabin which I posted on before is great for plane, but would explode as d gets large. The issue is how many adjacent cells there are.

  3. Jon permalink
    June 18, 2009 10:28 pm

    Very nice story. One tip: you might want to link to the definitions of any obscure terms you use, for example typewriter. TIA!

    • rjlipton permalink*
      June 19, 2009 7:48 am

      Love this comment.

      Should I explain what a VW bug is too? It was a car–was that clear?

  4. June 19, 2009 1:02 am

    Smirking again…

    “It’s hard to argue with a counter-example.”

    Only in maths NOT in politics or religion :-D

    Solve the exact nearest neighbor problem. Please.

    What do you mean?
    It IS solved, just not the way you would want ( Odn2 )
    This is another case of the fuzzy transition between natural language and maths (re: my disparaging of Lakatos “mathematical discovery”)
    For which value would you consider this “solved”?
    On ???

  5. June 19, 2009 4:12 am

    I seem to recall that after O(dnlogn) preprocessing (inserting n nodes into a kd-tree), you can do nearest neighbour searches in O((c^d)logn) time for some constant c>=2 dependent on the distance measure (c=2 for the infinity-norm). The exponentiality isn’t so great for high dimensions, sadly. It’s from an old paper by Friedman, Bentley, and Finkel, though a key page in their proof (page 10 or 11) seems to be missing in every copy I’ve looked at, so I wasn’t able to figure out exactly what all of their assumptions were.

  6. breakingkeyboards permalink
    June 19, 2009 4:33 am

    Instead of an hyperplane , using a kd-tree structure ( for example a quadtree structure in 2d ) is possible to check across only the neighbor in the structure so in the worst case 3^d -1 ( for example 8 cells in the quadtree ) for each point so n*3^d but probably is possible to do something better .

    • rjlipton permalink*
      June 19, 2009 7:52 am

      What I mean is can we get closer to linear? In this case the data is nd so can we do close to that. All the “fast” exact methods beat that, BUT, they hide an exponential constant based on d. So for low dimensions they are good, but for high they are not. Since many interesting problems clearly have huge d, the problem is still open.

  7. Rafee Kamouna permalink
    June 19, 2009 9:08 am

    Dear Prof Dick Lipton,

    “It’s hard to argue with a counter-example”. A wonderful assertion.

    Yet, it is absolutely easy to imagine that it does not exist. An infinite class of counter-examples to NP-completeness is here for a complete year:

    http://arxiv.org/PS_cache/arxiv/pdf/0806/0806.2947v7.pdf

    Another class of counter-examples (without fuzzy logic) is here:

    http://kamouna.wordpress.com/

    The Liar’s Paradox implies SAT is (NOT) NP-complete & The Kleene-Rosser Paradox implies SAT is (NOT) NP-complete.

  8. Shahab permalink
    July 13, 2009 2:41 pm

    Hi,

    I was wondering if you know of any negative result in this regard, something that shows using such and such tools, it needs time at least of the order n^k for some k.

    Shahab

    • rjlipton permalink*
      July 13, 2009 3:31 pm

      There are very few lower bounds on anything. With restrictions on the model of computation there are lower bounds. Do you have an application in mind?

      • Shahab permalink
        July 14, 2009 1:38 am

        Not really,

        It was essentially that it really came as a shock to me that such an easy looking problem does not have a generalization (specially because I used to take it for granted that the method that works on planes can be generalized to higher dimensions as well).

        So, I thought it was appealing to know some negative results as they usually point out what property hard instances have.

        Shahab

  9. March 4, 2011 6:28 pm

    I’m not a theorist, nor even a professional mathematician. And I know this comment is years late. But has there been any progress on the nearest neighbor problem?

    My unschooled intuition suggests two possible approaches, the first using Voronoi diagrams: add the points one at a time, and maintain a data structure that represents the Voronoi diagram of the points that have been added so far. The hope here is that it might be less expensive to update the Voronoi diagram than it is to compare a point against every other point. I think that when you add a new point to a Voronoi diagram, it carves out a new domain from the real estate of the domain it lands in, and its immediately adjacent domains. This would confine the amount of comparison you would have to do for each point.

    The other possible approach is to sort the coordinates independently in each dimension, and look for theorems that say that the nearest neighbors must be almost adjacent in all of those sorted lists. I don’t know how to rigorize that, though.

    You often have reasons for believing that particular kinds of approaches to a problem are doomed from the outset, for example, the “bait and switch” argument about boolean functions. Do you have reasons to think that the approaches I have sketched here are doomed?

    • rjlipton permalink*
      March 5, 2011 10:21 am

      ACW

      A great question, and one of my favorites. The exact problem is sill wide open.

    • March 5, 2011 4:18 pm

      Unfortunately, the equivalent of the Voronoi diagram in n dimensions can take exponential space to represent explicitly, like how the n-dimensional convex hull can take exponential space to represent explicitly. :(

      The hand-wavy reason is that in 2D, the boundaries are edges, in 3D, the boundaries are faces, in 4D, the boundaries are volumes, etc. Describing each boundary becomes more and more complicated with each dimension. I’d wondered a while back if there’s some way to avoid this by storing only some indirect description of the boundary for convex hull, but I didn’t look into it much. It’s nonetheless more difficult than it looks.

    • March 5, 2011 5:34 pm

      Sorry, I forgot to address your second suggested approach. Unfortunately, you can have both: 1) Points that are extremely far from the query but close to the query in many lists, and 2) Points that are the closest to the query, but far from the query in at least one list.

      Ironically, an example that can be either is when a point has all identical coordinates to the query (distance 0) except for one coordinate, for which it’s far. Supposing that the distance in that one coordinate is sqrt(n), if you have a bunch of points that are around 1 unit away from the query in all n dimensions, those will be around the same distance from the query. Varying whether those points are closer or farther, you can create either case.

      That said, there are many heuristics out there, and this is a nice simple heuristic. How well it works would depend on the details you choose to implement, and the details of what you’re using it for.

Trackbacks

  1. Can We Translate English To English? « Gödel’s Lost Letter and P=NP
  2. Testing Galactic Algorithms « Gödel’s Lost Letter and P=NP
  3. Giving Some Thanks Around the Globe « Gödel’s Lost Letter and P=NP
  4. Theorems Are Forever: A Great One From 1492 « Gödel’s Lost Letter and P=NP

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,759 other followers