I don’t mean to harp on this one, but I’m still seeing a lot of potentially misleading statements out there concerning “power laws” and “long tails.” One of the most prolific writers on this topic is Clay Shirky, so I’ll use a recent comment of his as an example (sorry to pick on Clay, but I guess pioneers take the arrows!).
In a post to the del.icio.us mailing list, Clay responds to someone seeking an idea of what the average number of bookmarks is per user. The question is asked in light of another, which is: are most bookmarks made by a small core of heavy users?
Clay responds that:
The deceptive thing about systems like this are that the average is meaningless, as the distribution is not a bell curve. There are, yes, a small core of highly active users, but the decision about who goes in that core is totally random, since the distribution of links per user is roughly a power law.
Both del and spurl get most of their activity from a small group of highly active users, but bigger systems benefit from both higher heads *and* longer tails.
I think that several parts of this exchange are representative of statements that really need some clarification to avoid potential confusion:
(1) The histogram of users ranked by number of bookmarks made (“the distribution of links per user”) is a ranked graph, which by definition is *always* decreasing, and can *never* be a bell curve.
(2) I don’t know about the actual del.icio.us data, but it’s true that many Internet-related ranked histograms seem to fit a power law. However, many others do not, despite exhibiting a “long tail”; they may better fit a negative logarithm, inverse exponential, or more complicated function.
(3) The question of whether the average number of bookmarks per user is meaningful is best decided by considering a completely different, non-ranked distribution, namely the number of users having a given number of bookmarks. This is the PDF corresponding to the ranked graph, and *can* be a bell curve with a meaningful average.
(4) While it is true that “bigger systems benefit from both higher heads *and* longer tails,” in general this usually just makes the histogram fit the curve better; it is rather the shape of the curve that determines whether or not “most activity is from a small group of highly active users.”
Here I want to expand on the second and third points. In a previous post, I showed that if the ranked data fits a power law, then the corresponding PDF is also a power law. Once this is understood, it is clearer why Clay is correct in saying that if the histogram of users ranked by bookmarks follows a power law, then the histogram of number of users per number of bookmarks also does, and therefore it’s true that an average is pretty much meaningless.
However, if the ranked data has a “long tail” but doesn’t really fit a power law, the corresponding PDF *can* in fact have a meaningful average; in fact, it can be an exact bell curve! So while a ranked histogram that fits a power law implies a meaningless average, a ranked histogram that just exhibits a “long tail” does not, and that’s why it’s better to look at the PDF when trying to answer this question.
To show this, we can analyze the above example by following the logic of my previous post in reverse. So, let’s assume that the distribution of the number of bookmarks made by users fits a perfect bell curve. This means that the PDF is a Gaussian or normal distribution:
This is the ideal case where an average is *most* meaningful and informative. Now, this means that we can integrate to form the corresponding CDF, showing the percentage of users who have made b or more bookmarks:
Finally, we can then invert this to get the ranked graph of users in order of the number of bookmarks made:
Now, I’d certainly agree that most actual histograms I’ve seen on tagging data, etc. fit a power law much better than the above graph; but if you were faced with data that looked like the above, especially with a rescaled y axis, you might think something like “oh, it’s just the long tail of a power law but the top guys are lower,” and not necessarily realize that this changes the situation significantly; in particular, the underlying PDF might not even be a decreasing function.
The main point is this:
If you consider ranked histograms, it’s easy to see power laws everywhere.
But in many cases it may be that fitting another curve would be more informative, or that the ranked graph is not the right one to be looking at in the first place.
Returning to Clay’s post, the other original question that was asked was: Are most bookmarks made by a small core of heavy users? This can be answered by looking at the ranked histogram: if the median line dividing the area under the curve in half is far to the left, then most bookmarks are made by the top few users.
To get an even clearer answer to this question, it seems to me that the obvious thing to do would be to integrate the ranked graph, resulting in a curve from which you could easily read that the top u% of users were responsible for b% of all bookmarks made. For example, if the ranked histogram fit an exact power law b = 1/u, then the integrated graph would be a logarithm:
If we consider a ranked histogram that fits a higher order power law b = 1/u^2, the integrated graph shows an even higher dominance by top users:
In contrast, integrating the ranked graph resulting from a Gaussian PDF gives:
Here we can easily see that the heavy users do not have much of an outsized influence. I guess the purpose of all this, besides to get it straight in my own head, is to underscore the importance of this fact: the histogram you want to consider, and the usefulness of the curve that you fit to the data, depends very much upon the question you want to answer.