Skip to content

Rankings Versus Ratings

October 22, 2015

Handling both with the amazing generalized Kendall tau distance


Michelle Kwan was one of the last great figure skaters to compete under the historic “6.0” ranking system. She won the world championship five times under that system but was a squeaker second in the 1998 Winter Olympics and a more-distant third in 2002. An injury on the eve of the 2006 Winter Olympics prevented her from competing under the new and current system, which involves a complex numerical rating formula.

Today we discuss rankings versus ratings with an eye to complexity and managing them by common tools.

A ranking of {r} items {u_i} from a set of {n} items is a list {(u_1,u_2,\dots,u_r)} in non-ascending order of preference; it is complete if {r = n.} A rating of those items is a function {f: \{u_1,\dots,u_r\} \rightarrow \mathbb{R}.} A rating always induces a ranking—maybe allowing ties—but not the other way around. When a ranking has no ties we associate to it the ordinal function {ord(u_1) = 1,\dots,ord(u_r) = r}; when there are ties we may average the ordinals or use some integer near the average. Given {m} different rankings or ratings—of subsets of the items that may only partly overlap and perhaps span multiple categories—the object is to aggregate them into the single ranking or rating that best represents the sample.

The old 6.0 system used points on a 0.0–6.0 scale (increments of 0.1) in the two categories technical merit and artistic presentation to make a number up to 12.0 for each skater. Thenceforth the judge’s scores were treated as a ranking no matter how far ahead one skater was over the next. There were rules for avoiding and breaking ties, but in extreme cases like those in our post on Kenneth Arrow’s paradox ties could be left to stand.

The new rating system nearly always produces a total order upon simply adding the judges’ scores for each skater. After averaging, the highest score achieved in international competition to date for a man is 295.27 by Patrick Chan of Canada, and for a woman, 228.56 by Kim Yuna of Korea. It is not clear what these scores are “out of”—apparently not 300 and 250. They do not connote perfection the way “6.0” did. An appraisal by The Economist during last year’s Winter Olympics noted that the new system pushes skaters to technical extremes—while leaving

…little time either during routines or in training sessions for optional acrobatic or artistic showstoppers, like Michelle Kwan’s notorious spirals, in which she flashed a huge smile while speeding down the ice and audiences routinely jumped to their feet.

One aspect at the 2015 Applied Decision Theory conference, which we recently mentioned, that surprised me was the theme from several speakers that rankings and preferences are often more “human” than ratings, both for obtaining data and for representing it. So let us look more closely at ratings and rankings—and also preferences that may have cycles and so not yield rankings.

Some Decision Data Structures

The first fact is obvious but central: a rating of {n} items needs size only {\tilde{O}(n)}, where the {\sim} absorbs {O(\log n)}-sized labels of items and log-sized numbers. Whereas a preference structure may involve specifying up to {\binom{n}{2} = \Omega(n^2)} pairs even if it does define a total order. I’ve appreciated this difference when poring over tables of hundreds of ranked players before my fantasy baseball and football drafts. How do they decide to rank Peyton Manning ahead of Drew Brees, then Brees above Tom Brady—and really, Manning over Brady? The answer is they didn’t think any preferences; they ran a model to generate expected “fantasy league points” for each player and transcribed the total order from these ratings. (I do not know how much Brady’s expected points were deflated by his possible multi-game suspension.)

Preference structures are more combinatorial. When all pairs are compared they define a tournament. They also can be staggered as nets of conditional preferences (CP-nets). A CP-net allows saying to prefer {u_i} over {u_j} if the aggregate chooses {u_k} over {u_\ell}, but reverse the preference if {u_\ell} is chosen. Such a CP-net might be expanded out to a preference structure on sets of items, but the CP-net form can be exponentially more succinct. The keynote for the first day of ADT 2015 was by Kristen Brent Venable on her paper with Cristina Cornelio, Umberto Grandi, Judy Goldsmith, Nick Mattei, and Francesca Rossi on a compact way to represent probabilistic uncertainty about preferences by distributions over CP-nets.

A ranking without ties is also a familiar combinatorial object: a permutation. Thus algebraic concepts from permutation groups are relevant to analyzing rankings. It seems less natural to apply them to ratings—valuations apply to elements of fields in connection with algebraic geometry, but permutations of those elements seem not to be involved. Preferences that don’t yield rankings can have algebraic structure in a different way: they might decompose into cycles. Acyclic preference structures have well-defined transitive closures and so inherit all the theory of partial orders and lattices. All this seems a lot of bother compared to using ratings, but one surprise alluded to by Venable and by Michel Regenwetter presenting the first paper of ADT 2015 is that it is much harder to get human respondents to give ratings than preferences, and the rating numbers are often unreliable when they do.

Aggregation Problems

An aggregation problem can be specified by giving a distance measure {\Delta(R,{\cal S})} between a ranking or rating {R} and a set {{\cal S}} of such rankings/ratings. The set {{\cal S}} can simply be the set of {R^1,\dots,R^m} being aggregated or might be derived from it. The problem in either case is given {R^1,\dots,R^m} to find {R^*} minimizing {\Delta(R^*,{\cal S})}, or to decide whether the distance can be made lower than some given value {v.} In the simple case we can use a binary measure {\delta(R,R')} of dissimilarity and apply (e.g.) “least squares” to ask for {R^*} minimizing

\displaystyle  \sum_{k=1}^m \delta(R^*,R^k)^2.

The complexity parameters {m} and {n} govern these problems. An example with high {m} and low {n} is determining one or more winners of an election. An example with low {m} and high {n} is comparing and aggregating results from a few search engines. A much-cited reference for the latter is the 2001 paper “Rank Aggregation methods for the Web” by Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar (who graduated from Buffalo under me in 1996).

It is not surprising that many of these aggregation problems are {\mathsf{NP}}-hard, and remain hard even when {m} or {n} is a fixed small number like 3 or 4 (see e.g. these slides by David Williamson). What was surprising at ADT 2015 was that several speakers didn’t care so long as the problems could be transcribed efficiently into {\mathsf{SAT}}: “we can just use our {\mathsf{SAT}}-solvers on them.” Welcome to a world where {\mathsf{NP}} may as well equal {\mathsf{P}}, but we care about the difference between {\tilde{O}(n)} and quadratic time, or quadratic versus cubic. With that in mind let’s consider those {\delta} functions.

The Generalized Kendall Tau Distance

Let {\mu(u_i,u_j)} stand for any function on pairs of items that is antisymmetric:

\displaystyle  \mu(u_j,u_i) = -\mu(u_i,u_j).

If we have a rating function {f}, then there is the simple rating difference function

  • {\mu_f(u_i,u_j) = f(u_i) - f(u_j).}

If we only have a ranking, then we can define

  • {\mu_1(u_i,u_j) = ord(u_j) - ord(u_i)}, or

  • {\mu_0(u_i,u_j) = \text{\it sign}(ord(u_j) - ord(u_i)).}

Note that these depend only on the function values {f(u_i)}, or {ord(u_i)} in the case of rankings. We can sacrifice some generality for clarity by assuming this of {\mu.} Then we can represent two ratings or rankings {X} and {Y} having respective functions {f,g} by vectors of function values {x_i = f(u_i)} and {y_i = g(u_i).} This further allows us to pretend that “the same” {\mu} function is being applied to pairs of values for each ranking, though per above there are really different functions {\mu_f(u_i,u_j) = \mu(f(u_i),f(u_j))} and {\mu_g(u_i,u_j) = \mu(g(u_i),g(u_j)).} There might be some useful generality to gain by using separate {\mu_f,\mu_g} and noting that no other information about {X} and {Y} is used, but the form with {x,y} values is nicest. Define:

\displaystyle  \tau(X,Y) = \frac{\sum_{i,j = 1}^n \mu(x_i,x_j)\mu(y_i,y_j)} {\left(\sum_{i,j=1}^n \mu(x_i,x_j)^2\right)^{1/2}\left(\sum_{i,j=1}^n \mu(y_i,y_j)^2\right)^{1/2}}.

The British statistician Maurice Kendall originally defined this for tie-less rankings and {\mu = \mu_0.} Then the denominator is just a sum of {1^2 = (-1)^2 = 1} over all {i \neq j} so it equals {\binom{n}{2}}, the number of pairs. The numerator has {+1} for every pair {(u_i,u_j)} of items on which {X} and {Y} agree, and {-1} whenever one prefers {u_i} and the other prefers {u_j.} Hence the numerator is the number of agreements on pairs in the rankings minus the number of disagreements. Thus

  • {\tau(X,Y)} achieves its maximum of {+1} when {X = Y};

  • {\tau(X,Y)} achieves its minimum of {-1} when {Y} equals {X} reversed, so {\mu_0(y_i,y_j) = -\mu_0(x_i,x_j).}

These properties in fact hold for general {X,Y} and {\mu}: {\tau(X,Y)} is always a number between {-1} and {+1.} Kendall also gave this general form. Often {\tau} is reserved for the case with {\mu_0} where the denominator is fixed at {\binom{n}{2}} even when there are ties and {\Gamma} is used as above, but let us stick with {\tau.}

For {\mu_1} the Kendall tau simplifies neatly to the {\rho} correlation measure of Charles Spearman:

\displaystyle  \rho(X,Y) = 1 - \frac{6\sum_{i=1}^n (x_i - y_i)^2}{n(n^2 - 1)}.

Showing this equals {\sum_{i,j = 1}^n (x_i - x_j)(y_i - y_j)} divided by the square root of {\sum_{i,j=1}^n (x_i-x_j)^2} times {\sum_{i,j=1}^n \mu(y_i-y_j)^2} is a nice exercise.

The above formula also shows that {\rho(X,Y)} is computable in linear time. The {\mu_0} case looks quadratic but is in {O(n\log n)} time by an algorithm discovered by William Knight in 1966. The algorithm first sorts pairs {(x_i,y_i)} by {x_i.} Then while merge-sorting them by {y_i} it can quickly count the number of {x_i} in pieces being merged that a {y_j} is greater than, and use these to count all pairs {(u_i,u_j)} on which {X,Y} agree during the sort. This suggests the question, echoed on page 1031 of this 2009 paper:

For which other {f,g,\mu} is {\tau(X,Y)} computable in {\tilde{O}(n)} time?

What I find even more attractive about {\tau} are properties of scaling that can be tuned for other purposes by flexible choice of {\mu.}

Chess Engines and Search Engines

In a recent post I tabulated how the values and rankings of chess moves given by computer chess engines change as their depth of search increases, and the strong evident influence these changes have on human choices. We want to compare both the rankings and the values across depths. Noting the common generalization between Spearman’s {\rho} and the original {\tau}, we don’t have to make a black-and-white choice between them but instead ask:

What’s the best {\mu} function for this application?

First we note that both shift invariance and scale invariance can be conferred by certain choices of {\mu.} You want the former when {(x_1 + c,\dots,x_n + c)} should be considered the same as {(x_1,\dots,x_n)} for any constant {c.} This is automatic for rankings but is desired for ratings when only differences matter. Ipso-facto, if {\mu(x_i,x_j)} depends only on {x_i - x_j} then the resulting {\tau} is shift-invariant.

Scale invariance comes in when a 0-to-100 scale is treated the same as a 0-to-10 scale. This is conferred when {\mu} is linear but not only thus. Consider

\displaystyle  \mu_2(x,x') = \frac{x - x'}{x^2 + x'^2} \qquad\text{or, say,}\qquad \mu_3(x,x') = \frac{1}{(x - x')^3}.

These {\mu} are pseudolinear in the sense that for every {c} there is a {c'} depending only on {c} such that for all {x,x'}, {\mu(cx,cx') = c'\mu(x,x').} For all such {\mu} the tau measure is invariant in both the {X} and {Y} scales separately: for all constants {c,d > 0},

\displaystyle  \tau(cX,dY) = \tau(X,Y).

This is nice because chess programs have their own scale factors. We can compare results from two engines on the same position without having to worry about normalizing these factors.

The most important desire, however, is common to the search-engine application: We want only the figures on good moves to matter. Here finally is where we cannot be satisfied with rankings and need ratings. If all but 3 moves in a position are bad, it doesn’t matter much which move an engine ranks 4th. Likewise, when only a few search hits matter we can ignore the rest—and ignore items ranked outside the first few dozen on both search engines anyway.

If we compare not the raw engine values but rather their differences from the optimal value, then high values are bad (moreover, this takes care of shift invariance before choosing {\mu}). A function like {\mu_2} makes {\mu(x_i,x_j)} small when either {x_i} or {x_j} is big—so any comparison with a bad move is dampened. This is OK if the engines are expected to agree on moves that are bad, and we want to emphasize their selectivity among the moves that they all recognize as reasonably good. For further purposes one might choose a higher power in the denominator or implement other considerations, such as how {\mu_3} magnifies close numerical agreements.

In my paper with Tamal Biswas we used our {\tau} measure to define the complexity of a chess positions by how much the ratings of moves jump about at successive depths. From this we derive measures of difficulty and discrimination of chess positions viewed as aptitude-test items.

Open Problems

We have given a flyover of issues and mathematical tools and problems involving rankings and ratings. We’ve noted some remarkable properties of the Kendall tau distance—there are more we could discuss in regard to independence and correlation—and have suggested some further ways to apply it. Can a more unified treatment of rankings and ratings help in progressing from the former to the latter?

Here is a wild question. Scale invariance effectively makes every vector a unit vector. Inner products of complex unit vectors range from {+1} to {-1}, though they take complex values in-“between.” Can Kendall’s {\tau} be related to an inner product or even be made to “work like” one? It is interpreted as a difference of probabilities, whereas in quantum mechanics the absolute values of certain inner products become square roots of probabilities.

[slight word changes in last main section]

5 Comments leave one →
  1. tcmJOE permalink
    October 22, 2015 2:41 pm

    Is there some correction you’d need if your values $f(/mu)$ were noisy–for example, you’re doing a statistical sample to determine your rankings? Or do you simply default doing some sort of R^2 or chi^2 like fit?

  2. Anon1 permalink
    October 28, 2015 11:28 am

    This blog has been lately drifting into too much of algebraic details. Who would be interested in reading such articles?

  3. August 15, 2017 11:33 am

    You might like this: Kendall’s Tau on a partial ordering:


  1. The World Turned Upside Down | Gödel's Lost Letter and P=NP
  2. Code It Up | Gödel's Lost Letter and P=NP

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s