Comparing proofs for the Jaccard metric

 BetterExplained source

Kalid Azad is the founder of the website Better Explained. It is devoted to explaining mathematical concepts. He also has written two books.

Today we discuss how some proofs provide a concise explanation whereas others promote exploration of related concepts.

Azad’s site has a rich page titled, “Math Proofs vs. Explanations (aka Nutrition vs. Taste).” It argues that the best explanations start with an analogy to a relation that readers already understand. Even if the connection is not sharp, it can be refined once the reader’s attention is solid. This is opposed to a formal proof in which every step is sharp and correct but intuition is wanting.

To this we add the role proofs can play in exploration. If you have one proof of a theorem that you understand, there is value in seeking other proofs that use other ideas. Usually we think of ideas as coming first—as thoughts we refine into a proof. The advantage of starting with a proof is already having certitude and sharpness—you know a recipe that works and now can try varying and augmenting it.

## Jaccard Distance as Example

Azad’s page gives examples of proofs for the Pythagorean Theorem and for ${e^{i\theta} = \cos(\theta) + i\sin(\theta)}$. It then quotes from William Thurston’s essay “On Proofs and Progress in Mathematics,” which we once mentioned. We will use the example of Jaccard distance ${J_\delta}$ from our previous post. We start with this definition:

$\displaystyle J_\delta(A,B) = \frac{|A \;\Delta\; B|}{|A \cup B|},$

now using ${\Delta}$ for the symmetric difference ${(A \cup B) \setminus (A \cap B)}$. So the triangle inequality becomes, for any finite sets ${A,B,C}$:

$\displaystyle \frac{|A \;\Delta\; C|}{|A \cup C|} \leq \frac{|A \;\Delta\; B|}{|A \cup B|} + \frac{|B \;\Delta\; C|}{|B \cup C|}. \ \ \ \ \ (1)$

We think the proof we gave in the last post is simple and direct and intuitive but maybe not explorative. It first connects the solid understanding that without the denominators this would be the well-known triangle inequality for Hamming distance. To reprise, it considers ${A,C}$ fixed and varies ${B}$ to arrive at that simpler fact in three steps:

1. If ${B}$ contains ${b}$ elements not in ${A \cup C}$ then removing them subtracts ${b}$ from both right-hand numerators and both right-hand denominators. Since those fractions are each ${< 1}$ (else 1 would be immediately true), the right-hand side goes down.

2. Then we have ${B \subseteq A \cup C}$ and can replace the denominators by ${|A \cup C|}$ without increasing the right-hand side.

3. Now we have a common denominator and a statement equivalent to the known truth about Hamming distance. Since undoing the first two steps to restore the original ${B}$ can only increase the right-hand side, (1) is proved in all cases.

This reasoning readily extends to nonnegative measures ${f}$ on ${A,B,C}$ besides simple counting, provided the removal of elements from ${B}$ makes the same additive-or-proportional change to ${f(A \;\Delta\; B)}$ as it does to ${f(A \cup B)}$, and likewise for the other fraction.

## Three Snapshot Proofs

The first short proof should join the pantheon of half-page journal papers. Under fair use, here it is in one screenshot:

Perhaps this is too short. We think this proof would have been more satisfying if a few more lines of calculation had been added. Let us divide the region ${T_1}$ into its inner part ${T_{1i}}$ and outer part ${T_{1o}}$ and do likewise for ${T_2,T_3}$. Then it seems the intent was:

$\displaystyle \begin{array}{rcl} d(S_1,S_3) &=& 1 - \frac{|S_1 \cap S_3|}{|S_1 \cup S_3|} = 1 - \frac{|T_{2i}| + |V|}{|U| - |T_{2o}|}\\ &\leq& 1 - \frac{|V|}{|U|} = \frac{|U| - |V|}{|U|} = \frac{|T_1| + |T_2| + |T_3|}{|U|}\\ &\leq& \frac{|T_1| + |T_2|}{|U|} + \frac{|T_2| + |T_3|}{|U|}\\ &\leq& \frac{|T_1| + |T_2|}{|U| - |T_{3o}|} + \frac{|T_2| + |T_3|}{|U| - |T_{1o}|} = d(S_1,S_2) + d(S_2,S_3). \end{array}$

The end uses the symmetric-difference definition of ${J_{\delta}}$, so perhaps fully expanding this paper’s intent would have been longer. One can also begin with that definition to get a shorter calculation, but it skips over the ${1 - \frac{|V|}{|U|}}$ step. Indeed, it does not mention ${V}$ at all, so it was not intended. The proof by Artur Grygorian and Ionut Iacob in a short paper in last October’s College J. Math. strikes us as a similar-style proof.

The second proof comes from a MathOverflow thread. It assigns a variable to each region of the Venn diagram, forms the fractions, and cross-multiplies to obtain “the following monstrosity”:

The fact that no coefficient is negative completes the proof. This is clear from a computer algebra system, but what about why no negative term appears?

We have realized since the last post that the second of two proofs given in the 2016 paper by Sven Kosub, which we linked in that post, is really equivalent to ours. This is easier to see if one just presumes ${f(\emptyset) = 0}$ in the following:

Here sub-modularity is a standard property for which Kosub cites the equivalent condition that whenever ${B \subseteq D}$ and ${x \notin D}$,

$\displaystyle f(B \cup\{x\}) - f(B) \geq f(D \cup \{x\}) - f(D).$

This suffices for step 1 of our earlier proof, first taking ${D = A \cup B}$ then ${D = B \cup C}$; the rest of that proof needs only that ${f}$ is monotone (and implicitly ${f(\emptyset) = 0}$).

## A Magical Proof

Now we look at proofs that add ideas. The first one still strikes us as clean and magical. We are computer scientists so it is natural to think of finite sets as binary-valued vectors of length ${n}$. They have a ${1}$ in position ${i}$ precisely when ${i}$ is in the set. Of course ${n}$ is the size of the “universe.”

Now let ${X}$ be such a non-zero vector. The key is to use a probabilistic proof. We will show that we can relate the Jaccard distance to the outcome of a simple random experiment. The experiment once selected leads to a simple proof—it only requires the union bound. Recall this is the fact that

$\displaystyle P[E_{1} \vee E_{2}] \le P[E_{1}] +P[E_{2}],$

for any two events ${E_{1}}$ and ${E_{2}}$.

The cool idea is to look at the permutations of the vector ${X}$. For a permutation ${\pi}$ let us define ${\pi(X)}$ to be

$\displaystyle x_{\pi(1)},\cdots,x_{\pi(n)}.$

Let ${\mathsf{first}(X)=i}$ provided ${x_{i}}$ is the first value that is equal to ${1}$. Of course since ${X}$ is non-empty it follows that this is well defined.

Note ${\mathsf{first}(\pi(X))}$ is a random variable that depends on the choice of the permutation ${\pi}$. The key is to see that the probability that ${\mathsf{first}(\pi(X))=\mathsf{first}(\pi(Y))}$ when we average over all permutations uniformly is equal to

$\displaystyle \frac{|X \cap Y|}{|X \cup Y|}.$

This follows by noting that there are ${|XY|}$ ways to select the same ${i}$ and there are ${|X \cup Y|}$ total ways to select an ${i}$. Complementing gives us that the probability of ${\mathsf{first}(\pi(X)) \neq \mathsf{first}(\pi(Y))}$ equals ${J_\delta(X,Y)}$.

Now hark back to our sets ${A,B,C}$. The event

$\displaystyle \mathsf{first}(\pi(A)) \neq \mathsf{first}(\pi(C))$

is subsumed by the disjunction of events

$\displaystyle \mathsf{first}(\pi(A)) \neq \mathsf{first}(\pi(B)) \vee \mathsf{first}(\pi(B)) \neq \mathsf{first}(\pi(C))$

regardless of what ${B}$ is. By the simple union bound, the probability of the first event is at most the sum of the probabilities of the latter two events. We have thus proved

$\displaystyle J_\delta(A,C) \leq J_\delta(A,B) + J_\delta(B,C).$

The last step is the same as in the proof that Hamming distance is a metric. What does the randomized view gain us? It gains a nice interpretation of ${J_\delta(A,B)}$ as the probability that ${A}$ and ${B}$ hash to different values under the min-hash function ${\mathsf{first}\circ\pi}$ for random ${\pi}$. Min-hashing is used all the time—see this book chapter by Jure Leskovec, Anand Rajaraman, and Jeffrey Ullman, with this proof in section 3.3.3.

Atri Rudra suggested to us the “game” of adjusting ${B}$ one element at a time to walk it toward an extreme value. The sets ${A}$ and ${C}$ can be adjusted too. We start by assuming the triangle inequality (1) is false and make moves that can only keep it that way, until we reach a case where it is obviously true.

Step 1 of our proof already plays this game by removing from ${B}$ any elements not in ${A \cup C}$. So we really start the game with ${B \subseteq A \cup C}$ and we want to walk it to ${B = A \cup C}$. Simply replacing the denominators ${|A \cup B|}$ and ${|B \cup C|}$ in (1) by ${|A \cup C|}$ was good in step 2 of the proof but is not a legal move in this game.

What we can do legally is add elements from ${A \cap C}$ to ${B}$: those leave the denominators unchanged but lower the numerators ${|A \;\Delta\; B|}$ and ${|B \;\Delta\; C|}$. The interesting case is when we want to add to ${B}$ an element from ${A \setminus C}$ or from ${C \setminus A}$. The former add decreases the numerator ${|A \;\Delta\; B|}$ and increases the denominator ${|B \cup C|}$ while leaving ${A \cup B}$ unchanged, but it increases the numerator ${|B \;\Delta\; C|}$. Let us abstract the right-hand side of (1) to ${\frac{p}{q} + \frac{r}{s}}$. Then the former add converts it to

$\displaystyle \frac{p-1}{q} + \frac{r+1}{s+1} \qquad\text{and the latter add to}\qquad \frac{p+1}{q+1} + \frac{r-1}{s}.$

If both moves increase the right-hand side, then we must have

$\displaystyle \frac{p}{q} + \frac{r}{s} < \frac{p-1}{q} + \frac{r+1}{s+1}, \quad\text{so}\quad \frac{1}{q} < \frac{r+1}{s+1} - \frac{r}{s} = \frac{s-r}{s(s+1)} < \frac{1}{s+1}.$

And from

$\displaystyle \frac{p}{q} + \frac{r}{s} < \frac{p+1}{q+1} + \frac{r-1}{s}, \quad\text{we get}\quad \frac{1}{s} < \frac{1}{q+1}\;.$

But cross-multiplying gives the contradiction ${q+ 1 < s < s+1 < q}$. So one or both moves must always be possible. This grows ${B}$ to include either all of ${A}$ or all of ${C}$. The rest of the argument to gobble up all of ${A \cup C}$ we leave to you, dear readers.

Compared to the above proofs, this is tedious. But it captures some tensions among the sizes of ${A}$, ${B}$, and ${C}$ that may inform intuitions about Jaccard similarity under changes in the sets.

## Open Problems

Which proof do you like best for explanation and which for creative impulse?

This is our ${801^{st}}$ post. We intended this discussion as number 800 but were surprised to find the simple proof by reduction to triangle for Hamming distance (steps numbered 1-2-3 above). Are we really the first to write it down, with acknowledgment also to Kosub?

[some typo fixes]

1. December 25, 2018 6:14 pm

Typo fixes: In Gilbert expansion, extraneous union symbol on 1st line. Missing bars in 2nd line denominator. In “Magical Proof”, para. 2, sentence 2, missing verb.

• December 27, 2018 8:43 pm

Well noted—thanks very much. Fixed.

2. December 25, 2018 10:09 pm

P versus NP is considered as one of the most important open problems in computer science. This consists in knowing the answer of the following question: Is P equal to NP? A precise statement of the P versus NP problem was introduced independently by Stephen Cook and Leonid Levin. Since that date, all efforts to find a proof for this problem have failed. Given a positive integer x and a collection S of positive integers, MAXIMUM is the problem of deciding whether x is the maximum of S. We prove this problem is complete for P. Another problem is SUCCINCT-MAXIMUM. SUCCINCT-MAXIMUM contains the instances of MAXIMUM that can be represented by an exponentially more succinct way. We show this succinct version of MAXIMUM must not be in P. Under the assumption of P = NP, we prove the membership of SUCCINCT-MAXIMUM in P is also hold. In this way, we demonstrate the complexity class P is not equal to NP by the reduction ad absurdum rule. Another major complexity classes are LOGSPACE and NLOGSPACE. Whether LOGSPACE = NLOGSPACE is another fundamental question that it is as important as it is unresolved. We show the problem MAXIMUM can be solved in logarithmic space. Consequently, we demonstrate the complexity class LOGSPACE is equal to P and thus, LOGSPACE is equal to NLOGSPACE.