Turning rankings into distributions

OK, I finally cleared up what was bothering me about those long tail graphs, prompted by Phil’s comment and helped a lot by this article.

The issue is that long tail graphs have an x axis comprised of items ranked by their y axis value; e.g. for a social bookmarking site we can graph users ranked by how many taggings they’ve performed, and approximate it by an inverse power law:

Taggraph (ranked)

One nice thing about a ranked graph is that the “area” under the curve is equal to the total value associated with the items spanned on the ranked axis; e.g. for the above graph, the area to the left of the median line represents 50% of the total number of taggings.

However, it seems as if both axes here are carrying essentially the same information (as Phil said, “that bar’s the tallest *and* it’s the first!”). But it turns out that the ranking and the value are in fact separate pieces of information, and a power law shape to the data carries over to more familiar distributions. As I go through how that works, I’ll add some notes on terminology, since it can get a bit confusing.

Firstly, a ranked graph like the one above that follows an exact 1/u inverse power law (i.e. a = 1) is often said to adhere to “Zipf’s Law.” This law originally described how the frequency of use of the nth-most-frequently-used word in any natural language is approximately inversely proportional to n. Subsequently the term Zipf has been used to refer to ranked data that can be fit to several different kinds of curves, but one thing’s for sure: a ranked graph is by definition always decreasing, and is not a probability distribution! (although as we’ll see below, it’s closely related to one).

The key to transforming a ranked graph into a more familiar distribution is this fact: saying that the nth ranked item has a value of y is equivalent to saying that n items have a value of y or more. So for the above graph, saying that the u-th user performed t taggings is equivalent to saying that u users performed t or more taggings:

Taggraph (t or more)

Now if we invert this graph and turn the number of users performing t or more taggings into the percentage of such users, we arrive at a probability distribution, namely the probability that a randomly selected user performed t or more taggings:

Taggraph (CDF)

This kind of distribution is called a Cumulative Distribution Function (CDF), and if it follows an inverse power law it is also sometimes referred to as a “Pareto distribution.” A Pareto distribution originally described the percentage of people owning more than x amount of wealth, and led to what is sometimes called the “Pareto principle” or the “80-20 rule”, e.g. that 20% of the population owns 80% of the wealth (note that 80 + 20 = 100 is a coincidence here, another common confusion!).

Finally, we can note that the number of users performing t or more taggings minus the number performing t+1 or more taggings is the number of users performing exactly t taggings, i.e. N(t) - N(t+1) = n(t) = -N’(t); here n(t) is the number of users performing t taggings, and is equal to the negative derivative of the number of users N(t) performing t or more taggings:

Taggraph (PDF)

This is a familiar Probability Density Function (PDF), and is the more common setting to describing a distribution as fitting a “power law.” Note that if the original ranked graph follows an inverse power law of any kind, so does the corresponding CDF and PDF. The converse is not true, i.e. if a PDF follows an inverse power law, the other graphs only do so if the shape parameter b = 1 + 1/a is greater than 1. Otherwise, the integral of t^-b with b < = 1 does not give a power law.

Of course, after all this, the fact is that at least for me, the original ranked graph is the one that most directly and intuitively answers the question at hand: are most taggings by the most active users? I'd be interested in comments and thoughts, or other graphs that aren't covered above.

** Addendum: Some technical notes, mostly to refresh my own memory

- For a discrete set of outcomes, the function corresponding to non-zero probability values for each outcome is called a Probability Mass Function (PMF).
- A PMF is also sometimes called a Probability Function (PF).
- A histogram (AKA bar chart, step function) can be obtained from a PMF by binning the outcomes into ranges (usually equal or exponential) in a continuous sample space (set of outcomes).
- For a continuous sample space, a Probability Density Function (PDF) gives the probability of an outcome in a specified range by integrating the function over this range. A PMF or its histogram can be approximated by a PDF.
- For a PDF, the probability of any specific outcome is 0, non-zero probabilities only exist by integrating the PDF over an interval of outcomes.
- The "area" under a PMF or PDF must be 1; the area is the sum of all values for a PMF and the integral over the entire sample space for a PDF.
- A power law can only approximate a PDF and must be cut off at its extremeties, since its integral is divergent and so cannot integrate to 1.
- A Cumulative Distribution Function (CDF) F(x) is usually defined in terms of a PDF f(x) by F(b) - F(a) = integral from a to b of f(x), i.e. f(x) = F'(x). This means F(x) is the probability of any outcome less than or equal to x.
- A CDF is also sometimes called a Distribution Function (DF).
- Sometimes a CDF is defined with a different inequality; e.g. a Pareto distribution is a CDF F(x) which is the probability of any outcome *greater* than or equal to x, so that f(x) = -F'(x).
- A Pareto distribution is characterized by a shape parameter a and a minimum value parameter m, and takes the form F(x) = a m^a / x^(a+1) for x >= m, 0 otherwise.

8 Responses to “Turning rankings into distributions”

  1. Phil Says:

    You make a good case for the conventional ‘long tail’ graph. But:

    saying that the u-th user performed t taggings is equivalent to saying that u users performed t or more taggings … if we invert this graph and turn the number of users performing t or more taggings into the percentage of such users, we arrive at a probability distribution

    This only really works if you can assume unique ‘u’s - and by implication unique ‘t’s. I’ve been looking at some real figures on inbound links. If you order them by ranking, they do show something like the familiar downward curve (5389, 3849, 3309, 3211, 3183, 2542…) But here are ‘rankings’ 90-104:
    816, 816, 812, 811, 792, 789, 786, 779, 779, 778, 778, 767, 758, 753, 753
    And so it goes on. Further out, of course, there are many more ‘duplicates’; over 2,000 of them in the case of the lowest non-zero value (one link, in other words). The simple ‘long tail’ histogram could represent these readily enough - as a descending mountain range with a lot of plateaux - but the position of any ‘column’ within one of those plateaux would be meaningless. In effect the X axis would carry information at some points but not others.

    Or am I missing something?

  2. Adam Says:

    Well, I think it’s a good point that the detailed ordering of users with equal values is indeed arbitrary, and that this ordering will be lost when the x axis is transformed into “# users performing t or more taggings.” But the only difference under this transformation will be that such users will be “missing” in the second histogram.

    For example, if there are 5 ranked users who performed 3, 2, 2, 2, and 1 taggings, this corresponds to 1 user who performed 3 or more taggings, 4 who performed 2 or more, and 5 who performed 1 or more. So the histogram {3, 2, 2, 2, 1} turns into the histogram {3, 0, 0, 2, 1}. Thus, the power law curve that is fit to the data would be the same.

    So, as far as I can tell, nothing depends on unique ‘t’s (or unique ‘u’s, which must be unique to put them on the x axis to begin with — although I think you must have been referring to each u having a unique t value). Thanks much for the comment, and please pass on any further thoughts; the transformation issue above was a surprise, I’m sure there’s other issues here…

  3. Phil Says:

    I think my point about ‘plateaux’ was that if you ‘clean’ a ranked graph by plotting “number of samples with value greater than n” on the X axis, you lose the ‘area’ effect you referred to initially - which is precisely the effect that’s got ‘long tail’ advocates so excited. Graphing without the duplicates would give you a series of widely-spaced fenceposts (X=9124, Y=10; 9591,9; 10095,8; 10642,7…) rather than a ‘long tail’ of dwindling plateaux.

    Inverting the graph, on the other hand - mea culpa for missing this yesterday - gives you a long tail with zero on the left; the long tail there actually represents the ’spike’ of heavy taggers, while the spike on the left is the ‘long tail’ of low users. This is in line with the suggestion I made, in rather more naive terms, back here. I’ve since got hold of some numbers & will post further when I get round to it.

  4. Adam Says:

    Right! Well, wait a sec.

    When we first do the inversion to get the CDF graph, the spike on the left isn’t really low users, it’s *all* users, since the y axis represents how many users have performed t or more taggings, and of course all users have performed 0 or more taggings.

    But after we take the derivative to get the final PDF graph, then the spike on the left does indeed represent the large number of low-tagging users, and the “long tail” is then the few users with many taggings. However, the “area” is not meaningful, as it is with the original ranked graph.

    Another interesting thing to note is that since we take the derivative, the tail becomes “less long” in the PDF. For example, if the original ranked histogram fits a t ~ 1/u curve (the way most long tails are illustrated), then the final PDF graph is n ~ 1/t^2, which is “taller and shorter tailed” than the original graph. I guess I should have showed this in the figure…

  5. None of you stand so tall « The gaping silence Says:

    […] This may seem like a minor nitpick, but it’s actually very important. Back to Adam: […]

  6. Who took the money? « The gaping silence Says:

    […] area; the fact that Pietro also invokes the Long Tail (which, as you’ll recall, is not what it seems) makes it all the more compelling (to me at […]

  7. Joe Says:

    The cumulative distribution function shown here was confusing at first… typically, a CDF is the probability that a user performed t or fewer taggings, rather than more. See http://en.wikipedia.org/wiki/Cumulative_distribution_function, for instance.

  8. admin Says:

    @Joe: True enough, but a prominent example of a CDF giving a probability >= t is the Pareto distribution. You’re probably right, it should be called out in the article, but at least I did mention it in the Addendum:

    “A Cumulative Distribution Function (CDF) is the probability of any outcome less than or equal to x. Sometimes a CDF is defined with a different inequality; e.g. a Pareto distribution is a CDF F(x) which is the probability of any outcome *greater* than or equal to x.”

Leave a Reply

Moderation is on, so your comment may not show up right away.