# Rankings Versus Ratings

*Handling both with the amazing generalized Kendall tau distance*

source |

Michelle Kwan was one of the last great figure skaters to compete under the historic “6.0” ranking system. She won the world championship five times under that system but was a squeaker second in the 1998 Winter Olympics and a more-distant third in 2002. An injury on the eve of the 2006 Winter Olympics prevented her from competing under the new and current system, which involves a complex numerical rating formula.

Today we discuss rankings versus ratings with an eye to complexity and managing them by common tools.

A *ranking* of items from a set of items is a list in non-ascending order of preference; it is complete if A *rating* of those items is a function A rating always induces a ranking—maybe allowing ties—but not the other way around. When a ranking has no ties we associate to it the ordinal function ; when there are ties we may average the ordinals or use some integer near the average. Given different rankings or ratings—of subsets of the items that may only partly overlap and perhaps span multiple categories—the object is to *aggregate* them into the single ranking or rating that best represents the sample.

The old 6.0 system used points on a 0.0–6.0 scale (increments of 0.1) in the two categories *technical merit* and *artistic presentation* to make a number up to 12.0 for each skater. Thenceforth the judge’s scores were treated as a ranking no matter how far ahead one skater was over the next. There were rules for avoiding and breaking ties, but in extreme cases like those in our post on Kenneth Arrow’s paradox ties could be left to stand.

The new rating system nearly always produces a total order upon simply adding the judges’ scores for each skater. After averaging, the highest score achieved in international competition to date for a man is 295.27 by Patrick Chan of Canada, and for a woman, 228.56 by Kim Yuna of Korea. It is not clear what these scores are “out of”—apparently not 300 and 250. They do not connote perfection the way “6.0” did. An appraisal by The Economist during last year’s Winter Olympics noted that the new system pushes skaters to technical extremes—while leaving

…little time either during routines or in training sessions for optional acrobatic or artistic showstoppers, like Michelle Kwan’s notorious spirals, in which she flashed a huge smile while speeding down the ice and audiences routinely jumped to their feet.

One aspect at the 2015 Applied Decision Theory conference, which we recently mentioned, that surprised me was the theme from several speakers that rankings and preferences are often more “human” than ratings, both for obtaining data and for representing it. So let us look more closely at ratings and rankings—and also preferences that may have cycles and so not yield rankings.

## Some Decision Data Structures

The first fact is obvious but central: a rating of items needs size only , where the absorbs -sized labels of items and log-sized numbers. Whereas a preference structure may involve specifying up to pairs even if it does define a total order. I’ve appreciated this difference when poring over tables of hundreds of ranked players before my fantasy baseball and football drafts. How do they decide to rank Peyton Manning ahead of Drew Brees, then Brees above Tom Brady—and really, Manning over Brady? The answer is they didn’t *think* any preferences; they ran a model to generate expected “fantasy league points” for each player and transcribed the total order from these ratings. (I do not know how much Brady’s expected points were deflated by his possible multi-game suspension.)

Preference structures are more combinatorial. When all pairs are compared they define a tournament. They also can be staggered as nets of *conditional preferences* (CP-nets). A CP-net allows saying to prefer over if the aggregate chooses over , but reverse the preference if is chosen. Such a CP-net might be expanded out to a preference structure on *sets* of items, but the CP-net form can be exponentially more succinct. The keynote for the first day of ADT 2015 was by Kristen Brent Venable on her paper with Cristina Cornelio, Umberto Grandi, Judy Goldsmith, Nick Mattei, and Francesca Rossi on a compact way to represent probabilistic uncertainty about preferences by distributions over CP-nets.

A ranking without ties is also a familiar combinatorial object: a permutation. Thus algebraic concepts from permutation groups are relevant to analyzing rankings. It seems less natural to apply them to ratings—valuations apply to elements of fields in connection with algebraic geometry, but permutations of those elements seem not to be involved. Preferences that don’t yield rankings can have algebraic structure in a different way: they might decompose into cycles. Acyclic preference structures have well-defined transitive closures and so inherit all the theory of partial orders and lattices. All this seems a lot of bother compared to using ratings, but one surprise alluded to by Venable and by Michel Regenwetter presenting the first paper of ADT 2015 is that it is much harder to get human respondents to give ratings than preferences, and the rating numbers are often unreliable when they do.

## Aggregation Problems

An aggregation problem can be specified by giving a distance measure between a ranking or rating and a set of such rankings/ratings. The set can simply be the set of being aggregated or might be derived from it. The problem in either case is given to find minimizing , or to decide whether the distance can be made lower than some given value In the simple case we can use a binary measure of *dissimilarity* and apply (e.g.) “least squares” to ask for minimizing

The complexity parameters and govern these problems. An example with high and low is determining one or more winners of an *election*. An example with low and high is comparing and aggregating results from a few *search engines*. A much-cited reference for the latter is the 2001 paper “Rank Aggregation methods for the Web” by Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar (who graduated from Buffalo under me in 1996).

It is not surprising that many of these aggregation problems are -hard, and remain hard even when or is a fixed small number like 3 or 4 (see e.g. these slides by David Williamson). What was surprising at ADT 2015 was that several speakers didn’t care so long as the problems could be transcribed efficiently into : “we can just use our -solvers on them.” Welcome to a world where may as well equal , but we care about the difference between and quadratic time, or quadratic versus cubic. With that in mind let’s consider those functions.

## The Generalized Kendall Tau Distance

Let stand for any function on pairs of items that is *antisymmetric*:

If we have a rating function , then there is the simple rating difference function

If we only have a ranking, then we can define

- , or

Note that these depend only on the function values , or in the case of rankings. We can sacrifice some generality for clarity by assuming this of Then we can represent two ratings or rankings and having respective functions by vectors of function values and This further allows us to pretend that “the same” function is being applied to pairs of values for each ranking, though per above there are really different functions and There might be some useful generality to gain by using separate and noting that no other information about and is used, but the form with values is nicest. Define:

The British statistician Maurice Kendall originally defined this for tie-less rankings and Then the denominator is just a sum of over all so it equals , the number of pairs. The numerator has for every pair of items on which and agree, and whenever one prefers and the other prefers Hence the numerator is the number of agreements on pairs in the rankings minus the number of disagreements. Thus

- achieves its maximum of when ;
- achieves its minimum of when equals reversed, so

These properties in fact hold for general and : is always a number between and Kendall also gave this general form. Often is reserved for the case with where the denominator is fixed at even when there are ties and is used as above, but let us stick with

For the Kendall tau simplifies neatly to the correlation measure of Charles Spearman:

Showing this equals divided by the square root of times is a nice exercise.

The above formula also shows that is computable in linear time. The case looks quadratic but is in time by an algorithm discovered by William Knight in 1966. The algorithm first sorts pairs by Then while *merge-sorting* them by it can quickly count the number of in pieces being merged that a is greater than, and use these to count all pairs on which agree during the sort. This suggests the question, echoed on page 1031 of this 2009 paper:

For which other is computable in time?

What I find even more attractive about are properties of *scaling* that can be tuned for other purposes by flexible choice of

## Chess Engines and Search Engines

In a recent post I tabulated how the values and rankings of chess moves given by computer chess *engines* change as their *depth* of search increases, and the strong evident influence these changes have on human choices. We want to compare both the rankings and the values across depths. Noting the common generalization between Spearman’s and the original , we don’t have to make a black-and-white choice between them but instead ask:

What’s the best function for this application?

First we note that both shift invariance and scale invariance can be conferred by certain choices of You want the former when should be considered the same as for any constant This is automatic for rankings but is desired for ratings when only differences matter. *Ipso-facto*, if depends only on then the resulting is shift-invariant.

Scale invariance comes in when a 0-to-100 scale is treated the same as a 0-to-10 scale. This is conferred when is linear but not only thus. Consider

These are *pseudolinear* in the sense that for every there is a depending only on such that for all , For all such the tau measure is invariant in both the and scales separately: for all constants ,

This is nice because chess programs have their own scale factors. We can compare results from two engines on the same position without having to worry about normalizing these factors.

The most important desire, however, is common to the search-engine application: *We want only the figures on good moves to matter*. Here finally is where we cannot be satisfied with rankings and need ratings. If all but 3 moves in a position are bad, it doesn’t matter much which move an engine ranks 4th. Likewise, when only a few search hits matter we can ignore the rest—and ignore items ranked outside the first few dozen on both search engines anyway.

If we compare not the raw engine values but rather their differences from the optimal value, then high values are bad (moreover, this takes care of shift invariance before choosing ). A function like makes small when either or is big—so any comparison with a bad move is dampened. This is OK if the engines are expected to agree on moves that are bad, and we want to emphasize their selectivity among the moves that they all recognize as reasonably good. For further purposes one might choose a higher power in the denominator or implement other considerations, such as how magnifies close numerical agreements.

In my paper with Tamal Biswas we used our measure to define the *complexity* of a chess positions by how much the ratings of moves jump about at successive depths. From this we derive measures of *difficulty* and *discrimination* of chess positions viewed as aptitude-test items.

## Open Problems

We have given a flyover of issues and mathematical tools and problems involving rankings and ratings. We’ve noted some remarkable properties of the Kendall tau distance—there are more we could discuss in regard to independence and correlation—and have suggested some further ways to apply it. Can a more unified treatment of rankings and ratings help in progressing from the former to the latter?

Here is a wild question. Scale invariance effectively makes every vector a unit vector. Inner products of complex unit vectors range from to , though they take complex values in-“between.” Can Kendall’s be related to an inner product or even be made to “work like” one? It is interpreted as a difference of probabilities, whereas in quantum mechanics the absolute values of certain inner products become square roots of probabilities.

[slight word changes in last main section]

Is there some correction you’d need if your values $f(/mu)$ were noisy–for example, you’re doing a statistical sample to determine your rankings? Or do you simply default doing some sort of R^2 or chi^2 like fit?

This blog has been lately drifting into too much of algebraic details. Who would be interested in reading such articles?