Skip to content

Explaining The Jaccard Metric

December 14, 2018


Why is it a metric?

Composite of source 1, source 2

Paul Jaccard was a botanist who worked at ETH in Zurich during much of the first half of the 20th century. He created, or discovered, the similarity notion that became the Jaccard metric. Very neat having a metric named after you.

Today we discuss proofs and explanations that the metric is indeed a metric.

The Jaccard index {J(A,B)} is the ratio of {|A \cap B|} to {|A \cup B|}. The metric is

\displaystyle \text{[*]} \qquad  J_{\delta}(A,B) = 1 - J(A,B) = 1 - \frac{|A \cap B|}{|A \cup B|}

provided {A} and {B} are not both empty sets. If {A} and {B} are both empty then {J(A,B)} by definition is {1} and so {J_\delta(A,B)} is {0}. Generally we can assume that all the sets are non-empty.

The key question is to show that this satisfies the triangle inequality. That is, we must show that

\displaystyle  \text{[**]}\qquad J_\delta(A,C) \le J_\delta(A,B) + J_\delta(B,C).

Many proofs of this are known, and it has been remarked that some are fairly complicated. Some are short, but continued recent interest seems to say they haven’t satisfied as explanations.

We think we can supply a quick explanation, if you are already familiar with the triangle inequality holding for Hamming distance on sets:

\displaystyle  |A \oplus C| \leq |A \oplus B| + |B \oplus C|,

where {A \oplus B = (A \cup B) \setminus (A \cap B)} is symmetric difference. By rewriting [*] as {J_\delta(A,B) = |A \oplus B|/|A \cup B|} we can see that [**] becomes similar but includes denominators. If [**] is false then

\displaystyle  \text{[***]}\qquad \frac{|A \oplus C|}{|A \cup C|} > \frac{|A \oplus B|}{|A \cup B|} + \frac{|B \oplus C|}{|B \cup C|}.

Now if we have {B \subseteq A \cup C} then replacing both right-hand denominators by {|A \cup C|} cannot make the right-hand side of [***] bigger. But then we have a common denominator, and we can see from Hamming distance that [**] must be true.

So suppose {B} includes {b} elements not in {A \cup C}. The left-hand side of [***] is at most {1}, so each right-hand fraction must be of the form {\frac{p}{q}} where $latex {p \frac{p – b}{q – b}}&fg=000000$, so removing those elements from {B} would also make the right-hand side of [***] smaller. That brings us back to the {B \subseteq A \cup C} case and the previous contradiction. So [**] must be true. That’s our proof and explanation in brief.

A Careful Take

We will do the above proof more slowly and carefully, to ensure it is really clear. A convention: to make the formulas a bit more readable we use {AB} to denote the intersection of {A} and {B}. So the Jaccard metric is now

\displaystyle  J_\delta(A,B) = 1 - \frac{|AB|}{|A \cup B|}.

First we explain the main ideas:

{\bullet } We will argue that the intermediate set {B} in the triangle inequality can be constrained. The set {B} can be a subset of {A \cup C}. The intuition is that any extra elements in {B} can only make the triangle inequality weaker.

{\bullet } We will replace the definition of the Jaccard metric by equivalent one. This new definition is much closer to a known metric.

{\bullet } We will reduce the triangle inequality finally to a known triangle inequality.

Proof:

Let’s assume that {A,B,C} are the sets, and we wish to prove the triangle inequality:

\displaystyle  1 - \frac{|A C|}{|A \cup C|} \le 1-\frac{|A B|}{|A \cup B|} + 1-\frac{|B C|}{|B \cup C|}.

Claim: We can assume that {B \subseteq A \cup C}. Suppose there was an element {x} in {B} but not in {A \cup C}. Then removing {x} from {B} would only tighten the triangle inequality. That is the LHS

\displaystyle  1 - \frac{|A C|}{|A \cup C|}

stays the same and the RHS terms

\displaystyle  1-\frac{|A B|}{|A \cup B|} \text{ and } 1-\frac{|B C|}{|B \cup C|},

can only decrease.

Claim: We can re-write {J_\delta(X,Y)} as

\displaystyle  \frac{|X \oplus Y|}{|X \cup Y|}.

As noted above, {X \oplus Y} is the set of elements that are in {X} or {Y} but not both.

Claim: We note that after applying the last claim three times, [**] becomes:

\displaystyle  \frac{|A \oplus C|}{|A \cup C|} \le \frac{|A \oplus B|}{|A \cup B|} + \frac{|B \oplus C|}{|B \cup C|}.

Claim : Since {B \subseteq A \cup C} we can multiply by {|A \cup C|} and get that the triangle inequality is implied by

\displaystyle  |A \oplus C| \le |A \oplus B| + |B \oplus C|.

Note this uses that

\displaystyle  |A \oplus B| + |B \oplus C| \le |A \oplus B| \frac{|A \cup C|}{|A \cup B|} + |B \oplus C|\frac{|A \cup C|}{|B \cup C|},

which follows from

\displaystyle  1 \le \frac{|A \cup C|}{|A \cup B|} \text{ and } 1 \le \frac{|A \cup C|}{|B \cup C|}.

Claim : But the last step is that

\displaystyle  |A \oplus C| \le |A \oplus B| + |B \oplus C|

is just the triangle inequality for the Hamming distance. It uses that

\displaystyle  (x \oplus y) \le (x \oplus z) + (z \oplus y)

holds for any single bits {x,y,z}.

\Box

How We Found The Proof

If you’re curious how we found the above, we were trying to check a different kind of proof. We started with the statement of the triangle inequality:

\displaystyle  \text{[**]}\qquad 1 - \frac{|A C|}{|A \cup C|} \le 1-\frac{|A B|}{|A \cup B|} + 1-\frac{|B C|}{|B \cup C|}.

This looks a bit scary, with its multiple ratios and addition and subtraction. But the following feature jumps out:

The right side depends on {B} but the left side does not.

This suggests the idea of asking what happens if we change {B} one element at a time? Can we “walk” it to an extreme point at which the truth of [**] is obvious? A promising start was that we could remove any elements from {B} that are not in {A \cup C}, as shown above. So we can assume {B \subseteq A \cup C}. Can we move {B} toward {A \cup C} or at least {A \oplus C} while preserving the implication of truth for [**]?

Seen in this light, our above proof’s replacing the denominators by {|A \cup C|} is an “illegal move”—not a change to {B}—so less interesting. But we happened to notice it worked. Let’s follow the train of thought from where we got that {B \subseteq A \cup C} can be assumed. Okay, what the next move to make with {B}? Look again at the key expression:

\displaystyle  1 - \frac{|A C|}{|A \cup C|} \le 1-\frac{|A B|}{|A \cup B|} + 1-\frac{|B C|}{|B \cup C|}.

Can we simplify this in some way? The answer is yes. The structure of {1} minus an expression suggested that perhaps we could combine the {1} and the ratios. Indeed it is not too hard to note that

\displaystyle  1-\frac{|XY|}{|X \cup Y|} = \frac{|X \oplus Y|}{|X \cup Y|}.

Okay perhaps this is not obvious. It is not trivial, but it is a standard idea that {1 - p/q} looks like the complementation of the “probabilty” ratio {p/q}. Once you think of this the exact formula follows. Thus we can re-write the required triangle inequality now as

\displaystyle  \frac{|A \oplus C|}{|A \cup C|} \le \frac{|A \oplus B|}{|A \cup B|} + \frac{|B \oplus C|}{|B \cup C|}.

We are almost done. The ratios are annoying, so can we get rid of them? We have assumed that {B \subseteq A \cup C}. So it seems like a good idea to assume—for the moment—that {B} is actually equal to {A \cup C}. But then it is easy to see that {A \cup B = A \cup C} and {B \cup C = A \cup C}. So the above becomes after multiplying by {|A \cup C|},

\displaystyle  |A \oplus C| \le |A \oplus B| + |B \oplus C|.

This looks really nice, no more ratios. Wait what is this expression? The {\oplus} is the exclusive-or function and it is not hard to note that this is the classic Hamming distance. So this inequality is a fact. Recall the Hamming distance records the number of differences between two bit-vectors, but sets are really just bit vectors.

Are we there yet? Almost. We only need to argue that if {B} is less than {A \cup C} the inequality we need is actually stronger. So we are done.

Open Problems

Do you like our proof? Is it clear from the intro—or from the second section? Or are you still unsure why the Jaccard metric satisfies the triangle inequality? We may follow up with more about other proofs.

We don’t think the photo used by this online Alchetron bio of Paul Jaccard is the botanist. It looks too modern, for one. No other photo seems extant. Google Images guesses Alchetron’s image to be Adrian Herzog, but we think its highest Jaccard index is to Orepic user paul.jaccard, as shown at top. Can you solve the mystery of who?


[Switched to standard J-sub-delta notation to clarify and fix issues, fixed HTML conversion glitch with p-q fractions]

11 Comments leave one →
  1. December 15, 2018 1:49 pm

    If we identify sets with the uniform distribution on them, is there a way to extend the Jaccard metric to arbitrary probability distributions?

    • December 18, 2018 10:29 am

      This is partly the motivation I see in Sven Kosub’s paper. We intend a followup.

  2. December 15, 2018 2:22 pm

    I think the image on the left of your composite is of Jean-Paul Jaccard, who is the executive director of a german company: https://stramec.ch/portrait/about-us/

  3. December 15, 2018 3:36 pm

    If A and B are both empty, shouldn’t J(A, B) be 0? I think you are mixing up J and d_J, to use the notation of the Wikipedia article.

  4. T.S. Jayram permalink
    December 17, 2018 3:19 am

    My favorite proof is via minhash: for a random permutation \pi of the universe, let h_{\pi}(A) for each set A denote the smallest element in A w.r.t. \pi. Then it can be seen that:
    Pr_{\pi}[h_{\pi}(A) \ne h_{\pi}(B)] = J(A,B)
    This is an embedding of the Jaccard metric isometrically to the (scaled) discrete metric. Now the triangle inequality follows.

    This viewpoint of using hashing to understand similarity was perhaps first expounded by Moses Charikar in “Similarity estimation techniques from rounding algorithms”, STOC 2002.

    Another cute application that I like is that the Geomens-Williamson hyperplane rounding allows one to show that the angular distance between unit vectors in R^d is a metric. For d=3, this is apparently the “great circle” distance for which the standard proof, as far as I know, has more of a geometric flavor.

    • December 18, 2018 10:41 am

      Thanks. This is the most concise explication of the minhash proof we’ve found. A discussion of Jeffrey Ullman’s version (section 3.3.3 here) was originally intended for this post and may be in a followup.

  5. December 17, 2018 6:59 am

    Should J(A,B) be 0 if A and B are both empty?

    • December 18, 2018 10:33 am

      Thanks, Peter and Aaron above. I went with {J_\delta} for the metric.

  6. abc permalink
    December 17, 2018 2:52 pm

    I think that Jaccard metric can be derived from the (more general) fact that the following function, defined on pairs of finite random variables, is a metric. The function is d(X,Y)=1-\frac{I(X;Y)}{H(XY)}, where I(X;Y) denote the mutual information and H(XY) the joint entropy. Horibe (1973) proved that d(X,Y) is indeed a metric and the proof is very, very short. Now, it is well known that any valid formula with entropies and mutual information can be transformed into an equally valid formula (involving finite sets) in which H(XY) is substituted with \mu(A\cup B) and I(X;Y) with \mu(A\cap B), where \mu is an arbitrary set-additive function (see Csiszar and Korner Information theory book, p. 37). One gets Jaccard metric by choosing \mu be equal to set cardinality |\cdot|. This should also answer Chris Moore question, I believe.

Trackbacks

  1. Explanations and Explorations | Gödel's Lost Letter and P=NP

Leave a Reply to Peter Cameron Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s