Rankings: never a bell curve, not always a power law
I don’t mean to harp on this one, but I’m still seeing a lot of potentially misleading statements out there concerning “power laws” and “long tails.” One of the most prolific writers on this topic is Clay Shirky, so I’ll use a recent comment of his as an example (sorry to pick on Clay, but I guess pioneers take the arrows!).
In a post to the del.icio.us mailing list, Clay responds to someone seeking an idea of what the average number of bookmarks is per user. The question is asked in light of another, which is: are most bookmarks made by a small core of heavy users?
Clay responds that:
The deceptive thing about systems like this are that the average is meaningless, as the distribution is not a bell curve. There are, yes, a small core of highly active users, but the decision about who goes in that core is totally random, since the distribution of links per user is roughly a power law.
Both del and spurl get most of their activity from a small group of highly active users, but bigger systems benefit from both higher heads *and* longer tails.
I think that several parts of this exchange are representative of statements that really need some clarification to avoid potential confusion:
(1) The histogram of users ranked by number of bookmarks made (”the distribution of links per user”) is a ranked graph, which by definition is *always* decreasing, and can *never* be a bell curve.
(2) I don’t know about the actual del.icio.us data, but it’s true that many Internet-related ranked histograms seem to fit a power law. However, many others do not, despite exhibiting a “long tail”; they may better fit a negative logarithm, inverse exponential, or more complicated function.
(3) The question of whether the average number of bookmarks per user is meaningful is best decided by considering a completely different, non-ranked distribution, namely the number of users having a given number of bookmarks. This is the PDF corresponding to the ranked graph, and *can* be a bell curve with a meaningful average.
(4) While it is true that “bigger systems benefit from both higher heads *and* longer tails,” in general this usually just makes the histogram fit the curve better; it is rather the shape of the curve that determines whether or not “most activity is from a small group of highly active users.”
Here I want to expand on the second and third points. In a previous post, I showed that if the ranked data fits a power law, then the corresponding PDF is also a power law. Once this is understood, it is clearer why Clay is correct in saying that if the histogram of users ranked by bookmarks follows a power law, then the histogram of number of users per number of bookmarks also does, and therefore it’s true that an average is pretty much meaningless.
However, if the ranked data has a “long tail” but doesn’t really fit a power law, the corresponding PDF *can* in fact have a meaningful average; in fact, it can be an exact bell curve! So while a ranked histogram that fits a power law implies a meaningless average, a ranked histogram that just exhibits a “long tail” does not, and that’s why it’s better to look at the PDF when trying to answer this question.
To show this, we can analyze the above example by following the logic of my previous post in reverse. So, let’s assume that the distribution of the number of bookmarks made by users fits a perfect bell curve. This means that the PDF is a Gaussian or normal distribution:
![Gaussian: y = (1/sqrt(pi))Exp[-(4(x - 0.5))^2]](http://www.econometa.com/wp-images/post-images/rank-power-gaussian.gif)
This is the ideal case where an average is *most* meaningful and informative. Now, this means that we can integrate to form the corresponding CDF, showing the percentage of users who have made b or more bookmarks:
![Erfc (the complementary error function): y = (1/2)(Erfc[(4(x - 0.5))])](http://www.econometa.com/wp-images/post-images/rank-power-erfc.gif)
Finally, we can then invert this to get the ranked graph of users in order of the number of bookmarks made:
![Inverse Erfc: y = (1/4)InverseErfc[2x] + 0.5](http://www.econometa.com/wp-images/post-images/rank-power-InvErfc.gif)
Now, I’d certainly agree that most actual histograms I’ve seen on tagging data, etc. fit a power law much better than the above graph; but if you were faced with data that looked like the above, especially with a rescaled y axis, you might think something like “oh, it’s just the long tail of a power law but the top guys are lower,” and not necessarily realize that this changes the situation significantly; in particular, the underlying PDF might not even be a decreasing function.
The main point is this:
If you consider ranked histograms, it’s easy to see power laws everywhere.
But in many cases it may be that fitting another curve would be more informative, or that the ranked graph is not the right one to be looking at in the first place.
Returning to Clay’s post, the other original question that was asked was: Are most bookmarks made by a small core of heavy users? This can be answered by looking at the ranked histogram: if the median line dividing the area under the curve in half is far to the left, then most bookmarks are made by the top few users.
To get an even clearer answer to this question, it seems to me that the obvious thing to do would be to integrate the ranked graph, resulting in a curve from which you could easily read that the top u% of users were responsible for b% of all bookmarks made. For example, if the ranked histogram fit an exact power law b = 1/u, then the integrated graph would be a logarithm:

If we consider a ranked histogram that fits a higher order power law b = 1/u^2, the integrated graph shows an even higher dominance by top users:

In contrast, integrating the ranked graph resulting from a Gaussian PDF gives:
![Integral of Inverse Erfc: y = IntegralFrom0tox((1/4)InverseErfc[2t] + 0.5)dt](http://www.econometa.com/wp-images/post-images/rank-power-IntInvErfc.gif)
Here we can easily see that the heavy users do not have much of an outsized influence. I guess the purpose of all this, besides to get it straight in my own head, is to underscore the importance of this fact: the histogram you want to consider, and the usefulness of the curve that you fit to the data, depends very much upon the question you want to answer.
October 30th, 2005 at 12:42 pm
Dear poster,
Please revise on the general shape of a CDF before posting graphs. It will make your point palatable beyond the statistically illiterate.
Cheers,
Student
October 31st, 2005 at 11:47 am
Hi Student,
I’m not sure what you mean about the “general shape” of a CDF. Maybe you mean that a CDF often gives the probability of any outcome less than or equal to x, while I’m showing the probability of any outcome *greater* than or equal to x? If so, this is not an unusual variant, as I mentioned at the very end of this post; for example, a Pareto distribution is this latter type of CDF.
Adam
August 21st, 2006 at 4:25 pm
[…] Power laws and log-normal distributions are two-parameter distributions. The distributional form that best characterizes traffic to all pages may have many more than two parameters. Compared to a log-normal distributions and other distributions with more than two paramters, an approximating line has greater value for providing a simple, intuitive description of an important part of the popularity distribution. […]
July 17th, 2007 at 11:47 am
I found this discussion very helpful, especially in conjunction with reading Taleb’s new book.
The histogram that you derived from a gaussian PDF actually does look like many of the reall histograms -with a sharp ending marking the end of the shelf, lack of any new entries, etc.
Interesting that the difference between a gaussian PDF, the related histogram, and histograms with power law PDF’s may just depend upon how sharply the tail comes to an conclusion -or at least that is my conclusion after reading this article.
July 19th, 2007 at 4:02 pm
Hi Michael,
Thanks for the comment! You’re right, how sharply the tail ends and how peaked it is at the head can completely change the meaning of a ranked histogram. That’s why they can be misleading; it’s safer to consider the associated PDF instead, if the question you want to answer has to do with averages, etc.