Rankings Versus Ratings
Handling both with the amazing generalized Kendall tau distance
![]() |
source |
Michelle Kwan was one of the last great figure skaters to compete under the historic “6.0” ranking system. She won the world championship five times under that system but was a squeaker second in the 1998 Winter Olympics and a more-distant third in 2002. An injury on the eve of the 2006 Winter Olympics prevented her from competing under the new and current system, which involves a complex numerical rating formula.
Today we discuss rankings versus ratings with an eye to complexity and managing them by common tools.
A ranking of items
from a set of
items is a list
in non-ascending order of preference; it is complete if
A rating of those items is a function
A rating always induces a ranking—maybe allowing ties—but not the other way around. When a ranking has no ties we associate to it the ordinal function
; when there are ties we may average the ordinals or use some integer near the average. Given
different rankings or ratings—of subsets of the items that may only partly overlap and perhaps span multiple categories—the object is to aggregate them into the single ranking or rating that best represents the sample.
The old 6.0 system used points on a 0.0–6.0 scale (increments of 0.1) in the two categories technical merit and artistic presentation to make a number up to 12.0 for each skater. Thenceforth the judge’s scores were treated as a ranking no matter how far ahead one skater was over the next. There were rules for avoiding and breaking ties, but in extreme cases like those in our post on Kenneth Arrow’s paradox ties could be left to stand.
The new rating system nearly always produces a total order upon simply adding the judges’ scores for each skater. After averaging, the highest score achieved in international competition to date for a man is 295.27 by Patrick Chan of Canada, and for a woman, 228.56 by Kim Yuna of Korea. It is not clear what these scores are “out of”—apparently not 300 and 250. They do not connote perfection the way “6.0” did. An appraisal by The Economist during last year’s Winter Olympics noted that the new system pushes skaters to technical extremes—while leaving
…little time either during routines or in training sessions for optional acrobatic or artistic showstoppers, like Michelle Kwan’s notorious spirals, in which she flashed a huge smile while speeding down the ice and audiences routinely jumped to their feet.
One aspect at the 2015 Applied Decision Theory conference, which we recently mentioned, that surprised me was the theme from several speakers that rankings and preferences are often more “human” than ratings, both for obtaining data and for representing it. So let us look more closely at ratings and rankings—and also preferences that may have cycles and so not yield rankings.
Some Decision Data Structures
The first fact is obvious but central: a rating of items needs size only
, where the
absorbs
-sized labels of items and log-sized numbers. Whereas a preference structure may involve specifying up to
pairs even if it does define a total order. I’ve appreciated this difference when poring over tables of hundreds of ranked players before my fantasy baseball and football drafts. How do they decide to rank Peyton Manning ahead of Drew Brees, then Brees above Tom Brady—and really, Manning over Brady? The answer is they didn’t think any preferences; they ran a model to generate expected “fantasy league points” for each player and transcribed the total order from these ratings. (I do not know how much Brady’s expected points were deflated by his possible multi-game suspension.)
Preference structures are more combinatorial. When all pairs are compared they define a tournament. They also can be staggered as nets of conditional preferences (CP-nets). A CP-net allows saying to prefer over
if the aggregate chooses
over
, but reverse the preference if
is chosen. Such a CP-net might be expanded out to a preference structure on sets of items, but the CP-net form can be exponentially more succinct. The keynote for the first day of ADT 2015 was by Kristen Brent Venable on her paper with Cristina Cornelio, Umberto Grandi, Judy Goldsmith, Nick Mattei, and Francesca Rossi on a compact way to represent probabilistic uncertainty about preferences by distributions over CP-nets.
A ranking without ties is also a familiar combinatorial object: a permutation. Thus algebraic concepts from permutation groups are relevant to analyzing rankings. It seems less natural to apply them to ratings—valuations apply to elements of fields in connection with algebraic geometry, but permutations of those elements seem not to be involved. Preferences that don’t yield rankings can have algebraic structure in a different way: they might decompose into cycles. Acyclic preference structures have well-defined transitive closures and so inherit all the theory of partial orders and lattices. All this seems a lot of bother compared to using ratings, but one surprise alluded to by Venable and by Michel Regenwetter presenting the first paper of ADT 2015 is that it is much harder to get human respondents to give ratings than preferences, and the rating numbers are often unreliable when they do.
Aggregation Problems
An aggregation problem can be specified by giving a distance measure between a ranking or rating
and a set
of such rankings/ratings. The set
can simply be the set of
being aggregated or might be derived from it. The problem in either case is given
to find
minimizing
, or to decide whether the distance can be made lower than some given value
In the simple case we can use a binary measure
of dissimilarity and apply (e.g.) “least squares” to ask for
minimizing
The complexity parameters and
govern these problems. An example with high
and low
is determining one or more winners of an election. An example with low
and high
is comparing and aggregating results from a few search engines. A much-cited reference for the latter is the 2001 paper “Rank Aggregation methods for the Web” by Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar (who graduated from Buffalo under me in 1996).
It is not surprising that many of these aggregation problems are -hard, and remain hard even when
or
is a fixed small number like 3 or 4 (see e.g. these slides by David Williamson). What was surprising at ADT 2015 was that several speakers didn’t care so long as the problems could be transcribed efficiently into
: “we can just use our
-solvers on them.” Welcome to a world where
may as well equal
, but we care about the difference between
and quadratic time, or quadratic versus cubic. With that in mind let’s consider those
functions.
The Generalized Kendall Tau Distance
Let stand for any function on pairs of items that is antisymmetric:
If we have a rating function , then there is the simple rating difference function
If we only have a ranking, then we can define
-
, or
-
Note that these depend only on the function values , or
in the case of rankings. We can sacrifice some generality for clarity by assuming this of
Then we can represent two ratings or rankings
and
having respective functions
by vectors of function values
and
This further allows us to pretend that “the same”
function is being applied to pairs of values for each ranking, though per above there are really different functions
and
There might be some useful generality to gain by using separate
and noting that no other information about
and
is used, but the form with
values is nicest. Define:
The British statistician Maurice Kendall originally defined this for tie-less rankings and Then the denominator is just a sum of
over all
so it equals
, the number of pairs. The numerator has
for every pair
of items on which
and
agree, and
whenever one prefers
and the other prefers
Hence the numerator is the number of agreements on pairs in the rankings minus the number of disagreements. Thus
-
achieves its maximum of
when
;
-
achieves its minimum of
when
equals
reversed, so
These properties in fact hold for general and
:
is always a number between
and
Kendall also gave this general form. Often
is reserved for the case with
where the denominator is fixed at
even when there are ties and
is used as above, but let us stick with
For the Kendall tau simplifies neatly to the
correlation measure of Charles Spearman:
Showing this equals divided by the square root of
times
is a nice exercise.
The above formula also shows that is computable in linear time. The
case looks quadratic but is in
time by an algorithm discovered by William Knight in 1966. The algorithm first sorts pairs
by
Then while merge-sorting them by
it can quickly count the number of
in pieces being merged that a
is greater than, and use these to count all pairs
on which
agree during the sort. This suggests the question, echoed on page 1031 of this 2009 paper:
For which other
is
computable in
time?
What I find even more attractive about are properties of scaling that can be tuned for other purposes by flexible choice of
Chess Engines and Search Engines
In a recent post I tabulated how the values and rankings of chess moves given by computer chess engines change as their depth of search increases, and the strong evident influence these changes have on human choices. We want to compare both the rankings and the values across depths. Noting the common generalization between Spearman’s and the original
, we don’t have to make a black-and-white choice between them but instead ask:
What’s the best
function for this application?
First we note that both shift invariance and scale invariance can be conferred by certain choices of You want the former when
should be considered the same as
for any constant
This is automatic for rankings but is desired for ratings when only differences matter. Ipso-facto, if
depends only on
then the resulting
is shift-invariant.
Scale invariance comes in when a 0-to-100 scale is treated the same as a 0-to-10 scale. This is conferred when is linear but not only thus. Consider
These are pseudolinear in the sense that for every
there is a
depending only on
such that for all
,
For all such
the tau measure is invariant in both the
and
scales separately: for all constants
,
This is nice because chess programs have their own scale factors. We can compare results from two engines on the same position without having to worry about normalizing these factors.
The most important desire, however, is common to the search-engine application: We want only the figures on good moves to matter. Here finally is where we cannot be satisfied with rankings and need ratings. If all but 3 moves in a position are bad, it doesn’t matter much which move an engine ranks 4th. Likewise, when only a few search hits matter we can ignore the rest—and ignore items ranked outside the first few dozen on both search engines anyway.
If we compare not the raw engine values but rather their differences from the optimal value, then high values are bad (moreover, this takes care of shift invariance before choosing ). A function like
makes
small when either
or
is big—so any comparison with a bad move is dampened. This is OK if the engines are expected to agree on moves that are bad, and we want to emphasize their selectivity among the moves that they all recognize as reasonably good. For further purposes one might choose a higher power in the denominator or implement other considerations, such as how
magnifies close numerical agreements.
In my paper with Tamal Biswas we used our measure to define the complexity of a chess positions by how much the ratings of moves jump about at successive depths. From this we derive measures of difficulty and discrimination of chess positions viewed as aptitude-test items.
Open Problems
We have given a flyover of issues and mathematical tools and problems involving rankings and ratings. We’ve noted some remarkable properties of the Kendall tau distance—there are more we could discuss in regard to independence and correlation—and have suggested some further ways to apply it. Can a more unified treatment of rankings and ratings help in progressing from the former to the latter?
Here is a wild question. Scale invariance effectively makes every vector a unit vector. Inner products of complex unit vectors range from to
, though they take complex values in-“between.” Can Kendall’s
be related to an inner product or even be made to “work like” one? It is interpreted as a difference of probabilities, whereas in quantum mechanics the absolute values of certain inner products become square roots of probabilities.
[slight word changes in last main section]
Is there some correction you’d need if your values $f(/mu)$ were noisy–for example, you’re doing a statistical sample to determine your rankings? Or do you simply default doing some sort of R^2 or chi^2 like fit?
This blog has been lately drifting into too much of algebraic details. Who would be interested in reading such articles?
You might like this: Kendall’s Tau on a partial ordering:
http://erikerlandson.github.io/blog/2015/08/14/generalizing-kendalls-tau/