In a conversation related to my previous postings on power laws, a question came up: If a ranked distribution follows a power law, what percentage of the total is in the highest ranked bin? So for the example of a histogram of users ranked by the % of taggings, what percentage M of all taggings are made by the very top user?
It turns out that this depends on whether the power law is an exact inverse (Zipf: a = 1) power law or a higher order power law.
The top user u = 1 has M percent of all taggings, so the curve is t = Mu^-a. Each bar measures the percentage of taggings by that user, so the sum of all bars has to equal 1. So for N users we have
M + M/(2^a) + M/(3^a) + … + M/(N^a) = 1
M = 1/(1 + 1/(2^a) + 1/(3^a) + … + 1/(N^a)).
For a Zipf law with a = 1, the denominator is the harmonic series, which diverges; so that means the % of taggings by the top user drops as the number of users N gets larger. We can calculate M by remembering that the harmonic series sums to gamma + ln(N) as N approaches infinity, where gamma is the Euler-Mascheroni constant and ln is the natural log. We can check that this is close enough after N = 100, so calculating N = 10 by hand and using this formula for the rest we have:
Gotta love NumSum. But if a > 1, the series in the denominator converges, so that as the number of users N increases, the % of taggings by the top user M quickly settles to a constant:
This is all in follow-up to the fourth point from this post:
(4) While it is true that “bigger systems benefit from both higher heads *and* longer tails,” in general this usually just makes the histogram fit the curve better; it is rather the shape of the curve that determines whether or not “most activity is from a small group of highly active users.”
A Zipf law is a case where a bigger system actually has a distinct effect: the bigger the system, the lower the percentage resident in the highest ranked bin, resulting in a lower percentage of activity from the most active users. In the case of higher power laws, this percentage quickly settles to a steady constant, so size doesn’t have much of an effect once the system is reasonably big.
As an aside, I was also asked to post the graph presented at TagCamp showing a histogram that fits a “long tail” but not a power law, so here it is:
Although this looks similar to a power law, if we disregard the top two users the histogram actually fits the curve that corresponds to a perfect bell curve PDF. This means that in contrast to a power law, where the average number of taggings per user is essentially meaningless, above this average is maximally meaningful.