<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.2.1" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Turning rankings into distributions</title>
	<link>http://www.econometa.com/archives/15</link>
	<description>The economy of stuff about stuff</description>
	<pubDate>Wed, 07 Jan 2009 02:40:13 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.1</generator>

	<item>
		<title>By: admin</title>
		<link>http://www.econometa.com/archives/15#comment-15965</link>
		<author>admin</author>
		<pubDate>Mon, 30 Jun 2008 15:41:19 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-15965</guid>
		<description>@Joe: True enough, but a prominent example of a CDF giving a probability &gt;= t is the Pareto distribution. You're probably right, it should be called out in the article, but at least I did mention it in the Addendum:

"A Cumulative Distribution Function (CDF) is the probability of any outcome less than or equal to x. Sometimes a CDF is defined with a different inequality; e.g. a Pareto distribution is a CDF F(x) which is the probability of any outcome *greater* than or equal to x."</description>
		<content:encoded><![CDATA[<p>@Joe: True enough, but a prominent example of a CDF giving a probability >= t is the Pareto distribution. You&#8217;re probably right, it should be called out in the article, but at least I did mention it in the Addendum:</p>
<p>&#8220;A Cumulative Distribution Function (CDF) is the probability of any outcome less than or equal to x. Sometimes a CDF is defined with a different inequality; e.g. a Pareto distribution is a CDF F(x) which is the probability of any outcome *greater* than or equal to x.&#8221;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://www.econometa.com/archives/15#comment-15964</link>
		<author>Joe</author>
		<pubDate>Wed, 25 Jun 2008 17:19:09 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-15964</guid>
		<description>The cumulative distribution function shown here was confusing at first... typically, a CDF is the probability that a user performed t or fewer taggings, rather than more.  See http://en.wikipedia.org/wiki/Cumulative_distribution_function, for instance.</description>
		<content:encoded><![CDATA[<p>The cumulative distribution function shown here was confusing at first&#8230; typically, a CDF is the probability that a user performed t or fewer taggings, rather than more.  See <a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function," rel="nofollow">http://en.wikipedia.org/wiki/Cumulative_distribution_function,</a> for instance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Who took the money? &#171; The gaping silence</title>
		<link>http://www.econometa.com/archives/15#comment-15871</link>
		<author>Who took the money? &#171; The gaping silence</author>
		<pubDate>Fri, 16 Nov 2007 17:07:08 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-15871</guid>
		<description>[...] area; the fact that Pietro also invokes the Long Tail (which, as you&#8217;ll recall, is not what it seems) makes it all the more compelling (to me at [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] area; the fact that Pietro also invokes the Long Tail (which, as you&#8217;ll recall, is not what it seems) makes it all the more compelling (to me at [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: None of you stand so tall &#171; The gaping silence</title>
		<link>http://www.econometa.com/archives/15#comment-15867</link>
		<author>None of you stand so tall &#171; The gaping silence</author>
		<pubDate>Wed, 14 Nov 2007 21:54:04 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-15867</guid>
		<description>[...] This may seem like a minor nitpick, but it&#8217;s actually very important. Back to Adam: [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] This may seem like a minor nitpick, but it&#8217;s actually very important. Back to Adam: [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Adam</title>
		<link>http://www.econometa.com/archives/15#comment-18</link>
		<author>Adam</author>
		<pubDate>Wed, 15 Jun 2005 15:31:32 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-18</guid>
		<description>Right! Well, wait a sec. 

When we first do the inversion to get the CDF graph, the spike on the left isn't really low users, it's *all* users, since the y axis represents how many users have performed t or more taggings, and of course all users have performed 0 or more taggings. 

But after we take the derivative to get the final PDF graph, then the spike on the left does indeed represent the large number of low-tagging users, and the "long tail" is then the few users with many taggings. However, the "area" is not meaningful, as it is with the original ranked graph.

Another interesting thing to note is that since we take the derivative, the tail becomes "less long" in the PDF. For example, if the original ranked histogram fits a t ~ 1/u curve (the way most long tails are illustrated), then the final PDF graph is n ~ 1/t^2, which is "taller and shorter tailed" than the original graph. I guess I should have showed this in the figure...</description>
		<content:encoded><![CDATA[<p>Right! Well, wait a sec. </p>
<p>When we first do the inversion to get the CDF graph, the spike on the left isn&#8217;t really low users, it&#8217;s *all* users, since the y axis represents how many users have performed t or more taggings, and of course all users have performed 0 or more taggings. </p>
<p>But after we take the derivative to get the final PDF graph, then the spike on the left does indeed represent the large number of low-tagging users, and the &#8220;long tail&#8221; is then the few users with many taggings. However, the &#8220;area&#8221; is not meaningful, as it is with the original ranked graph.</p>
<p>Another interesting thing to note is that since we take the derivative, the tail becomes &#8220;less long&#8221; in the PDF. For example, if the original ranked histogram fits a t ~ 1/u curve (the way most long tails are illustrated), then the final PDF graph is n ~ 1/t^2, which is &#8220;taller and shorter tailed&#8221; than the original graph. I guess I should have showed this in the figure&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phil</title>
		<link>http://www.econometa.com/archives/15#comment-17</link>
		<author>Phil</author>
		<pubDate>Wed, 15 Jun 2005 10:06:46 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-17</guid>
		<description>I think my point about 'plateaux' was that if you 'clean' a ranked graph by plotting "number of samples with value greater than n" on the X axis, you lose the 'area' effect you referred to initially - which is precisely the effect that's got 'long tail' advocates so excited. Graphing without the duplicates would give you a series of widely-spaced fenceposts (X=9124, Y=10; 9591,9; 10095,8; 10642,7...) rather than a 'long tail' of dwindling plateaux.

Inverting the graph, on the other hand - mea culpa for missing this yesterday - gives you a long tail &lt;b&gt;with zero on the left&lt;/b&gt;; the long tail there actually represents the 'spike' of heavy taggers, while the spike on the left is the 'long tail' of low users. This is in line with the suggestion I made, in rather more naive terms, back &lt;a href="http://phenomenologic.blogspot.com/2005/05/when-is-spike-not-spike.html" rel="nofollow"&gt;here&lt;/a&gt;. I've since got hold of some numbers &#38; will post further when I get round to it.</description>
		<content:encoded><![CDATA[<p>I think my point about &#8216;plateaux&#8217; was that if you &#8216;clean&#8217; a ranked graph by plotting &#8220;number of samples with value greater than n&#8221; on the X axis, you lose the &#8216;area&#8217; effect you referred to initially - which is precisely the effect that&#8217;s got &#8216;long tail&#8217; advocates so excited. Graphing without the duplicates would give you a series of widely-spaced fenceposts (X=9124, Y=10; 9591,9; 10095,8; 10642,7&#8230;) rather than a &#8216;long tail&#8217; of dwindling plateaux.</p>
<p>Inverting the graph, on the other hand - mea culpa for missing this yesterday - gives you a long tail <b>with zero on the left</b>; the long tail there actually represents the &#8217;spike&#8217; of heavy taggers, while the spike on the left is the &#8216;long tail&#8217; of low users. This is in line with the suggestion I made, in rather more naive terms, back <a href="http://phenomenologic.blogspot.com/2005/05/when-is-spike-not-spike.html" rel="nofollow">here</a>. I&#8217;ve since got hold of some numbers &amp; will post further when I get round to it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Adam</title>
		<link>http://www.econometa.com/archives/15#comment-16</link>
		<author>Adam</author>
		<pubDate>Tue, 14 Jun 2005 16:21:10 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-16</guid>
		<description>Well, I think it's a good point that the detailed ordering of users with equal values is indeed arbitrary, and that this ordering will be lost when the x axis is transformed into  "# users performing t or more taggings." But the only difference under this transformation will be that such users will be "missing" in the second histogram. 

For example, if there are 5 ranked users who performed 3, 2, 2, 2, and 1 taggings, this corresponds to 1 user who performed 3 or more taggings, 4 who performed 2 or more, and 5 who performed 1 or more. So the histogram {3, 2, 2, 2, 1} turns into the histogram {3, 0, 0, 2, 1}. Thus, the power law curve that is fit to the data would be the same. 

So, as far as I can tell, nothing depends on unique ‘t’s (or unique ‘u’s, which must be unique to put them on the x axis to begin with -- although I think you must have been referring to each u having a unique t value). Thanks much for the comment, and please pass on any further thoughts; the transformation issue above was a surprise, I'm sure there's other issues here...</description>
		<content:encoded><![CDATA[<p>Well, I think it&#8217;s a good point that the detailed ordering of users with equal values is indeed arbitrary, and that this ordering will be lost when the x axis is transformed into  &#8220;# users performing t or more taggings.&#8221; But the only difference under this transformation will be that such users will be &#8220;missing&#8221; in the second histogram. </p>
<p>For example, if there are 5 ranked users who performed 3, 2, 2, 2, and 1 taggings, this corresponds to 1 user who performed 3 or more taggings, 4 who performed 2 or more, and 5 who performed 1 or more. So the histogram {3, 2, 2, 2, 1} turns into the histogram {3, 0, 0, 2, 1}. Thus, the power law curve that is fit to the data would be the same. </p>
<p>So, as far as I can tell, nothing depends on unique ‘t’s (or unique ‘u’s, which must be unique to put them on the x axis to begin with &#8212; although I think you must have been referring to each u having a unique t value). Thanks much for the comment, and please pass on any further thoughts; the transformation issue above was a surprise, I&#8217;m sure there&#8217;s other issues here&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phil</title>
		<link>http://www.econometa.com/archives/15#comment-15</link>
		<author>Phil</author>
		<pubDate>Tue, 14 Jun 2005 11:03:08 +0000</pubDate>
		<guid>http://www.econometa.com/archives/15#comment-15</guid>
		<description>You make a good case for the conventional 'long tail' graph. But:

&lt;i&gt;saying that the u-th user performed t taggings is equivalent to saying that u users performed t or more taggings ... if we invert this graph and turn the number of users performing t or more taggings into the percentage of such users, we arrive at a probability distribution&lt;/i&gt;

This only really works if you can assume unique 'u's - and by implication unique 't's. I've been looking at some real figures on inbound links. If you order them by ranking, they do show something like the familiar downward curve (5389, 3849, 3309, 3211, 3183, 2542...) But here are 'rankings' 90-104:
816, 816, 812, 811, 792, 789, 786, 779, 779, 778, 778, 767, 758, 753, 753
And so it goes on. Further out, of course, there are many more 'duplicates'; over 2,000 of them in the case of the lowest non-zero value (one link, in other words). The simple 'long tail' histogram could represent these readily enough - as a descending mountain range with a lot of plateaux - but the position of any 'column' within one of those plateaux would be meaningless. In effect the X axis would carry information at some points but not others.

Or am I missing something?</description>
		<content:encoded><![CDATA[<p>You make a good case for the conventional &#8216;long tail&#8217; graph. But:</p>
<p><i>saying that the u-th user performed t taggings is equivalent to saying that u users performed t or more taggings &#8230; if we invert this graph and turn the number of users performing t or more taggings into the percentage of such users, we arrive at a probability distribution</i></p>
<p>This only really works if you can assume unique &#8216;u&#8217;s - and by implication unique &#8216;t&#8217;s. I&#8217;ve been looking at some real figures on inbound links. If you order them by ranking, they do show something like the familiar downward curve (5389, 3849, 3309, 3211, 3183, 2542&#8230;) But here are &#8216;rankings&#8217; 90-104:<br />
816, 816, 812, 811, 792, 789, 786, 779, 779, 778, 778, 767, 758, 753, 753<br />
And so it goes on. Further out, of course, there are many more &#8216;duplicates&#8217;; over 2,000 of them in the case of the lowest non-zero value (one link, in other words). The simple &#8216;long tail&#8217; histogram could represent these readily enough - as a descending mountain range with a lot of plateaux - but the position of any &#8216;column&#8217; within one of those plateaux would be meaningless. In effect the X axis would carry information at some points but not others.</p>
<p>Or am I missing something?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
