<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Econometa &#187; Tagging</title>
	<atom:link href="http://www.econometa.com/tags/tagging/feed" rel="self" type="application/rss+xml" />
	<link>http://www.econometa.com</link>
	<description>The economy of stuff about stuff</description>
	<lastBuildDate>Sat, 05 Apr 2008 15:21:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Distributions in everyday life</title>
		<link>http://www.econometa.com/archives/55</link>
		<comments>http://www.econometa.com/archives/55#comments</comments>
		<pubDate>Mon, 15 Oct 2007 02:12:10 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://www.econometa.com/archives/55</guid>
		<description><![CDATA[Supplemented by Steve Hsu, a rant on how few people get the training to really understand distributions, which are increasingly important in everyday life. This reminds me of my foray into tagging, where I had to dust off my own understanding of distributions. On the one hand, the fact is that the topic can just [...]]]></description>
			<content:encoded><![CDATA[<p>Supplemented by <a href="http://infoproc.blogspot.com/2007/10/bounded-cognition.html">Steve Hsu</a>, a <a href="http://itre.cis.upenn.edu/%7Emyl/languagelog/archives/004992.html">rant</a> on how few people get the training to really understand distributions, which are increasingly important in everyday life.</p>
<p>This reminds me of my foray into <a href="http://www.econometa.com/tags/tagging">tagging</a>, where I had to dust off my own understanding of distributions. On the one hand, the fact is that the topic can just be plain hard, and at times pretty counterintuitive. On the other, I&#8217;m sure that numbers themselves seemed hard until they became part of almost everything we do.</p>
<p>One point that seems like a good one: it&#8217;s true that statistics are essential to understanding things like news reports, Google Analytics, sales dashboards, sports stats, etc., and it probably wouldn&#8217;t be a bad idea for this fact to be reflected in school curriculums.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.econometa.com/archives/55/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Top users and power laws</title>
		<link>http://www.econometa.com/archives/30</link>
		<comments>http://www.econometa.com/archives/30#comments</comments>
		<pubDate>Tue, 27 Dec 2005 06:05:43 +0000</pubDate>
		<dc:creator></dc:creator>
				<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://www.econometa.com/archives/30</guid>
		<description><![CDATA[In a conversation related to my previous postings on power laws, a question came up: If a ranked distribution follows a power law, what percentage of the total is in the highest ranked bin? So for the example of a histogram of users ranked by the % of taggings, what percentage M of all taggings [...]]]></description>
			<content:encoded><![CDATA[<p>In a conversation related to <a href="http://www.econometa.com/archives/25">my</a> <a href="http://www.econometa.com/archives/15">previous</a> <a href="http://www.econometa.com/archives/12">postings</a> on power laws, a question came up: If a ranked distribution follows a power law, what percentage of the total is in the highest ranked bin? So for the example of a histogram of users ranked by the % of taggings, what percentage M of all taggings are made by the very top user? </p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-top-user.gif" alt="top user in a power law" /></p>
<p>It turns out that this depends on whether the power law is an exact inverse (Zipf: a = 1) power law or a higher order power law. </p>
<p>The top user u = 1 has M percent of all taggings, so the curve is t = Mu^-a. Each bar measures the percentage of taggings by that user, so the sum of all bars has to equal 1. So for N users we have </p>
<p>M + M/(2^a) + M/(3^a) + &#8230; + M/(N^a) = 1</p>
<p>or</p>
<p>M = 1/(1 + 1/(2^a) + 1/(3^a) + &#8230; + 1/(N^a)).</p>
<p>For a Zipf law with a = 1, the denominator is the harmonic series, which diverges; so that means the % of taggings by the top user drops as the number of users N gets larger. We can calculate M by remembering that the harmonic series sums to gamma + ln(N) as N approaches infinity, where gamma is the Euler-Mascheroni constant and ln is the natural log. We can check that this is close enough after N = 100, so calculating N = 10 by hand and using this formula for the rest we have:</p>
<p><iframe src="http://numsum.com/spreadsheet/show_plain/7463" width="100%" height="300"></iframe></p>
<p>Gotta love NumSum. But if a > 1, the series in the denominator converges, so that as the number of users N increases, the % of taggings by the top user M quickly settles to a constant:</p>
<p><iframe src="http://numsum.com/spreadsheet/show_plain/7467" width="100%" height="300"></iframe></p>
<p>This is all in follow-up to the fourth point from <a href="http://www.econometa.com/archives/25">this post</a>:</p>
<blockquote><p>
(4) While it is true that &#8220;bigger systems benefit from both higher heads *and* longer tails,&#8221; in general this usually just makes the histogram fit the curve better; it is rather the shape of the curve that determines whether or not &#8220;most activity is from a small group of highly active users.&#8221;
</p></blockquote>
<p>A Zipf law is a case where a bigger system actually has a distinct effect: the bigger the system, the lower the percentage resident in the highest ranked bin, resulting in a lower percentage of activity from the most active users. In the case of higher power laws, this percentage quickly settles to a steady constant, so size doesn&#8217;t have much of an effect once the system is reasonably big.</p>
<p>As an aside, I was also asked to post the graph presented at <a href="http://tagcamp.org/">TagCamp</a> showing a histogram that fits a &#8220;long tail&#8221; but not a power law, so here it is:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-false-power-law.gif" alt="false power law" /></p>
<p>Although this looks similar to a power law, if we disregard the top two users the histogram actually fits the curve that corresponds to a perfect bell curve PDF. This means that in contrast to a power law, where the average number of taggings per user is essentially meaningless, above this average is maximally meaningful.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.econometa.com/archives/30/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Trillion dollar matrix crunch</title>
		<link>http://www.econometa.com/archives/27</link>
		<comments>http://www.econometa.com/archives/27#comments</comments>
		<pubDate>Wed, 23 Nov 2005 19:08:05 +0000</pubDate>
		<dc:creator></dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Personal data]]></category>
		<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://www.econometa.com/?p=27</guid>
		<description><![CDATA[When I saw Ethan put Nivi&#8217;s matrix into NumSum, I thought it was so cool that I had to take up Mike&#8217;s request to stick some of his thought-provoking wishlist into the matrix as well. Here&#8217;s my attempt: I got rid of the &#8220;expert&#8221; scope column, not because it&#8217;s not relevant, but because there weren&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>When I saw <a href="http://onotech.blogspot.com/2005_11_01_onotech_archive.html#113158806157792183">Ethan</a> put <a href="http://www.nivi.com/blog/article/the-trillion-dollar-web-20-matrix">Nivi&#8217;s matrix</a> into <a href="http://numsum.com/">NumSum</a>, I thought it was so cool that I had to take up <a href="http://www.techcrunch.com/2005/11/21/companies-id-like-to-profile-but-dont-exist/">Mike&#8217;s request</a> to stick some of his thought-provoking wishlist into the matrix as well. Here&#8217;s my attempt:</p>
<p><iframe src="http://numsum.com/spreadsheet/show_plain/4484" width="100%" height="370"></iframe></p>
<p><del datetime="2005-12-09T09:22:5008:00">I got rid of the &#8220;expert&#8221; scope column, not because it&#8217;s not relevant, but because there weren&#8217;t any entries and aren&#8217;t likely to be anytime soon &#8212; seems like amateurs are more in fashion than experts these days.</del> (UPDATE: <a href="http://onotech.blogspot.com/2005_12_01_onotech_archive.html#113403123495223076">Ethan</a> rightly points out that <a href="http://www.squidoo.com/">Squidoo</a> is an example of expert filtering, as is really any successful media outlet, including Mike&#8217;s TechCrunch itself; I updated the matrix to reflect this and a couple other suggestions &#8212; if anyone else has more suggestions for entries, please pass them on!)</p>
<p>Nivi also points out that there are other dimensions that could be added to this matrix, in particular that of metadata location. He mentions that a lot of this valuable data is on your desktop, to which I&#8217;d add that a lot is also &#8220;locked up&#8221; in various applications, e.g. your search history, your tags, your OPML list, etc.</p>
<p>I also thought a great point was that the metadata is about both you and the data it points to:</p>
<blockquote><p>
This metadata is metadata about the data it points to and metadata about your interests and attention. In fact, the utility of a piece of metadata in describing data may be inversely related to its utility in describing your interests. For example, your clicks describe your interests, but they don’t really say anything useful about the data you are clicking on. (Propers to Ethan Stock for this insight).
</p></blockquote>
<p>However, I&#8217;m not so sure about the idea of an inverse relationship between the utility of metadata along these two dimensions. Your clicks may not *directly* say anything useful about the data clicked on, but indirectly they&#8217;re at minimum a vote (as used by AdWords), and even better a proxy for collaborative filtering (as used by Amazon). Delicious makes this explicit by allowing you to pivot on users, tags (metadata), and URLs (the data pointed to); each one of these dimensions has a different meaning, but I think every one has some utility. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.econometa.com/archives/27/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Rankings: never a bell curve, not always a power law</title>
		<link>http://www.econometa.com/archives/25</link>
		<comments>http://www.econometa.com/archives/25#comments</comments>
		<pubDate>Sat, 22 Oct 2005 17:09:01 +0000</pubDate>
		<dc:creator></dc:creator>
				<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://www.econometa.com/?p=25</guid>
		<description><![CDATA[I don&#8217;t mean to harp on this one, but I&#8217;m still seeing a lot of potentially misleading statements out there concerning &#8220;power laws&#8221; and &#8220;long tails.&#8221; One of the most prolific writers on this topic is Clay Shirky, so I&#8217;ll use a recent comment of his as an example (sorry to pick on Clay, but [...]]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t mean to harp on this one, but I&#8217;m still seeing a lot of potentially misleading statements out there concerning &#8220;power laws&#8221; and &#8220;long tails.&#8221; One of the most prolific writers on this topic is <a href="http://shirky.com/">Clay Shirky</a>, so I&#8217;ll use a recent comment of his as an example (sorry to pick on Clay, but I guess pioneers take the arrows!). </p>
<p>In <a href="http://lists.del.icio.us/pipermail/discuss/2005-October/004025.html">a post</a> to the del.icio.us mailing list, Clay responds to someone seeking an idea of what the average number of bookmarks is per user. The question is asked in light of another, which is: are most bookmarks made by a small core of heavy users?</p>
<p>Clay responds that:</p>
<blockquote><p>
The deceptive thing about systems like this are that the average is meaningless, as the distribution is not a bell curve. There are, yes, a small core of highly active users, but the decision about who goes in that core is totally random, since the distribution of links per user is roughly a power law. </p>
<p>Both del and spurl get most of their activity from a small group of highly active users, but bigger systems benefit from both higher heads *and* longer tails.
</p></blockquote>
<p>I think that several parts of this exchange are representative of statements that really need some clarification to avoid potential confusion:</p>
<p>(1) The histogram of users ranked by number of bookmarks made (&#8220;the distribution of links per user&#8221;) is a ranked graph, which by definition is *always* decreasing, and can *never* be a bell curve. </p>
<p>(2) I don&#8217;t know about the actual del.icio.us data, but it&#8217;s true that many Internet-related ranked histograms seem to fit a power law. However, many others do not, despite exhibiting a &#8220;long tail&#8221;; they may better fit a negative logarithm, inverse exponential, or more complicated function. </p>
<p>(3) The question of whether the average number of bookmarks per user is meaningful is best decided by considering a completely different, non-ranked distribution, namely the number of users having a given number of bookmarks. This is the PDF corresponding to the ranked graph, and *can* be a bell curve with a meaningful average.</p>
<p>(4) While it is true that &#8220;bigger systems benefit from both higher heads *and* longer tails,&#8221; in general this usually just makes the histogram fit the curve better; it is rather the shape of the curve that determines whether or not &#8220;most activity is from a small group of highly active users.&#8221;</p>
<p>Here I want to expand on the second and third points. In a <a href="http://www.econometa.com/archives/15">previous post</a>, I showed that if the ranked data fits a power law, then the corresponding PDF is also a power law. Once this is understood, it is clearer why Clay is correct in saying that if the histogram of users ranked by bookmarks follows a power law, then the histogram of number of users per number of bookmarks also does, and therefore it&#8217;s true that an average is pretty much meaningless.</p>
<p>However, if the ranked data has a &#8220;long tail&#8221; but doesn&#8217;t really fit a power law, the corresponding PDF *can* in fact have a meaningful average; in fact, it can be an exact bell curve! So while a ranked histogram that fits a power law implies a meaningless average, a ranked histogram that just exhibits a &#8220;long tail&#8221; does not, and that&#8217;s why it&#8217;s better to look at the PDF when trying to answer this question.  </p>
<p>To show this, we can analyze the above example by following the logic of my previous post in reverse. So, let&#8217;s assume that the distribution of the number of bookmarks made by users fits a perfect bell curve. This means that the PDF is a Gaussian or normal distribution:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-gaussian.gif" alt="Gaussian: y = (1/sqrt(pi))Exp[-(4(x - 0.5))^2]" /></p>
<p>This is the ideal case where an average is *most* meaningful and informative. Now, this means that we can integrate to form the corresponding CDF, showing the percentage of users who have made b or more bookmarks:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-erfc.gif" alt="Erfc (the complementary error function): y = (1/2)(Erfc[(4(x - 0.5))])" /></p>
<p>Finally, we can then invert this to get the ranked graph of users in order of the number of bookmarks made:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-InvErfc.gif" alt="Inverse Erfc: y = (1/4)InverseErfc[2x] + 0.5" /></p>
<p>Now, I&#8217;d certainly agree that most actual histograms I&#8217;ve seen on tagging data, etc. fit a power law much better than the above graph; but if you were faced with data that looked like the above, especially with a rescaled y axis, you might think something like &#8220;oh, it&#8217;s just the long tail of a power law but the top guys are lower,&#8221; and not necessarily realize that this changes the situation significantly; in particular, the underlying PDF might not even be a decreasing function.</p>
<p>The main point is this: </p>
<p align=center><strong>If you consider ranked histograms, it&#8217;s easy to see power laws everywhere. </strong></p>
<p>But in many cases it may be that fitting another curve would be more informative, or that the ranked graph is not the right one to be looking at in the first place. </p>
<p>Returning to Clay&#8217;s post, the other original question that was asked was: Are most bookmarks made by a small core of heavy users? This can be answered by looking at the ranked histogram: if the median line dividing the area under the curve in half is far to the left, then most bookmarks are made by the top few users. </p>
<p>To get an even clearer answer to this question, it seems to me that the obvious thing to do would be to integrate the ranked graph, resulting in a curve from which you could easily read that the top u% of users were responsible for b% of all bookmarks made. For example, if the ranked histogram fit an exact power law b = 1/u, then the integrated graph would be a logarithm:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-log.gif" alt="Logarithm" /></p>
<p>If we consider a ranked histogram that fits a higher order power law b = 1/u^2, the integrated graph shows an even higher dominance by top users:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-x^-1.gif" alt="1/x" /></p>
<p>In contrast, integrating the ranked graph resulting from a Gaussian PDF gives:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/rank-power-IntInvErfc.gif" alt="Integral of Inverse Erfc: y = IntegralFrom0tox((1/4)InverseErfc[2t] + 0.5)dt" /></p>
<p>Here we can easily see that the heavy users do not have much of an outsized influence. I guess the purpose of all this, besides to get it straight in my own head, is to underscore the importance of this fact: the histogram you want to consider, and the usefulness of the curve that you fit to the data, depends very much upon the question you want to answer.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.econometa.com/archives/25/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Turning rankings into distributions</title>
		<link>http://www.econometa.com/archives/15</link>
		<comments>http://www.econometa.com/archives/15#comments</comments>
		<pubDate>Tue, 14 Jun 2005 01:30:46 +0000</pubDate>
		<dc:creator></dc:creator>
				<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://www.econometa.com/archives/15</guid>
		<description><![CDATA[OK, I finally cleared up what was bothering me about those long tail graphs, prompted by Phil&#8216;s comment and helped a lot by this article. The issue is that long tail graphs have an x axis comprised of items ranked by their y axis value; e.g. for a social bookmarking site we can graph users [...]]]></description>
			<content:encoded><![CDATA[<p>OK, I finally cleared up what was bothering me about those long tail graphs, prompted by <a href="http://phenomenologic.blogspot.com/">Phil</a>&#8216;s <a href="http://www.econometa.com/archives/12#comment-12">comment</a> and helped a lot by <a href="http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html">this article</a>. </p>
<p>The issue is that long tail graphs have an x axis comprised of items ranked by their y axis value; e.g. for a social bookmarking site we can graph users ranked by how many taggings they&#8217;ve performed, and approximate it by an inverse power law:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/taggraph-ranked.gif" alt="Taggraph (ranked)" /></p>
<p>One nice thing about a ranked graph is that the &#8220;area&#8221; under the curve is equal to the total value associated with the items spanned on the ranked axis; e.g. for the above graph, the area to the left of the median line represents 50% of the total number of taggings.</p>
<p>However, it seems as if both axes here are carrying essentially the same information (as Phil said, &#8220;that bar’s the tallest *and* it’s the first!&#8221;). But it turns out that the ranking and the value are in fact separate pieces of information, and a power law shape to the data carries over to more familiar distributions. As I go through how that works, I&#8217;ll add some notes on terminology, since it can get a bit confusing.  </p>
<p>Firstly, a ranked graph like the one above that follows an exact 1/u inverse power law (i.e. a = 1) is often said to adhere to &#8220;Zipf&#8217;s Law.&#8221; This law originally described how the frequency of use of the nth-most-frequently-used word in any natural language is approximately inversely proportional to n. Subsequently the term Zipf has been used to refer to ranked data that can be fit to several different kinds of curves, but one thing&#8217;s for sure: a ranked graph is by definition always decreasing, and is not a probability distribution! (although as we&#8217;ll see below, it&#8217;s closely related to one).</p>
<p>The key to transforming a ranked graph into a more familiar distribution is this fact: saying that the nth ranked item has a value of y is equivalent to saying that n items have a value of y or more. So for the above graph, saying that the u-th user performed t taggings is equivalent to saying that u users performed t or more taggings:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/taggraph-t-or-more.gif" alt="Taggraph (t or more)" /></p>
<p>Now if we invert this graph and turn the number of users performing t or more taggings into the percentage of such users, we arrive at a probability distribution, namely the probability that a randomly selected user performed t  or more taggings:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/taggraph-cdf.gif" alt="Taggraph (CDF)" /></p>
<p>This kind of distribution is called a Cumulative Distribution Function (CDF), and if it follows an inverse power law it is also sometimes referred to as a &#8220;Pareto distribution.&#8221; A Pareto distribution originally described the percentage of people owning more than x amount of wealth, and led to what is sometimes called the &#8220;Pareto principle&#8221; or the &#8220;80-20 rule&#8221;, e.g. that 20% of the population owns 80% of the wealth (note that 80 + 20 = 100 is a coincidence here, another common confusion!). </p>
<p>Finally, we can note that the number of users performing t or more taggings minus the number performing t+1 or more taggings is the number of users performing exactly t taggings, i.e. N(t) &#8211; N(t+1) = n(t) = -N&#8217;(t); here n(t) is the number of users performing t taggings, and is equal to the negative derivative of the number of users N(t) performing t or more taggings:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/taggraph-pdf.gif" alt="Taggraph (PDF)" /></p>
<p>This is a familiar Probability Density Function (PDF), and is the more common setting to describing a distribution as fitting a &#8220;power law.&#8221; Note that if the original ranked graph follows an inverse power law of any kind, so does the corresponding CDF and PDF. The converse is not true, i.e. if a PDF follows an inverse power law, the other graphs only do so if the shape parameter b = 1 + 1/a is greater than 1. Otherwise, the integral of t^-b with b < = 1 does not give a power law.</p>
<p>Of course, after all this, the fact is that at least for me, the original ranked graph is the one that most directly and intuitively answers the question at hand: are most taggings by the most active users? I'd be interested in comments and thoughts, or other graphs that aren't covered above.</p>
<p>** Addendum: Some technical notes, mostly to refresh my own memory</p>
<p> - For a discrete set of outcomes, the function corresponding to non-zero probability values for each outcome is called a Probability Mass Function (PMF).<br />
 - A PMF is also sometimes called a Probability Function (PF).<br />
 - A histogram (AKA bar chart, step function) can be obtained from a PMF by binning the outcomes into ranges (usually equal or exponential) in a continuous sample space (set of outcomes).<br />
 - For a continuous sample space, a Probability Density Function (PDF) gives the probability of an outcome in a specified range by integrating the function over this range. A PMF or its histogram can be approximated by a PDF.<br />
 - For a PDF, the probability of any specific outcome is 0, non-zero probabilities only exist by integrating the PDF over an interval of outcomes.<br />
 - The "area" under a PMF or PDF must be 1; the area is the sum of all values for a PMF and the integral over the entire sample space for a PDF.<br />
 - A power law can only approximate a PDF and must be cut off at its extremeties, since its integral is divergent and so cannot integrate to 1.<br />
 - A Cumulative Distribution Function (CDF) F(x) is usually defined in terms of a PDF f(x) by F(b) - F(a) = integral from a to b of f(x), i.e. f(x) = F'(x). This means F(x) is the probability of any outcome less than or equal to x.<br />
 - A CDF is also sometimes called a Distribution Function (DF).<br />
 - Sometimes a CDF is defined with a different inequality; e.g. a Pareto distribution is a CDF F(x) which is the probability of any outcome *greater* than or equal to x, so that f(x) = -F'(x).<br />
 - A Pareto distribution is characterized by a shape parameter a and a minimum value parameter m, and takes the form F(x) = a m^a / x^(a+1) for x >= m, 0 otherwise. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.econometa.com/archives/15/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>The long tail tagging the dog</title>
		<link>http://www.econometa.com/archives/12</link>
		<comments>http://www.econometa.com/archives/12#comments</comments>
		<pubDate>Sat, 04 Jun 2005 01:45:56 +0000</pubDate>
		<dc:creator></dc:creator>
				<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://www.econometa.com/?p=12</guid>
		<description><![CDATA[In a previous post, I mentioned some interesting graphs that could be made from public URL tagging data such as that at del.icio.us. I keep wanting to see these graphs, so I figured I&#8217;d post some details and issue a request / challenge / wheedle to the real hackers out there to slap something together. [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://www.econometa.com/archives/9">previous post</a>, I mentioned some interesting graphs that could be made from public URL tagging data such as that at del.icio.us. I keep wanting to see these graphs, so I figured I&#8217;d post some details and issue a request / challenge / wheedle to the real hackers out there to slap something together. Of course, I&#8217;m not sure the data is accessible enough to pull this off, so it may not be possible for a third party to do it&#8230;</p>
<p>What I&#8217;m picturing is a lot like <a href="http://tools.waglo.com/durl">Durl</a>, except instead of history trends, URLs / tags / users would be associated with distribution graphs. There are many possible such graphs, so you have to pick the ones that answer the most interesting questions. The three that seem most interesting to me are:</p>
<p><img src="http://www.econometa.com/wp-images/post-images/taggraphs-per-item.gif" alt="Taggraphs (per item)" /></p>
<p>So for example, upon entering a URL to get the first graph, each bar on the X axis represents a tag, with the Y axis giving the number of users who tagged the URL with that tag. The vertical line corresponds to the median, i.e. half of the taggings were done using tags on either side of that line; thus the long tail could be considered to be the portion to the right of this line, this &#8220;area&#8221; being equal to that to the left of this line. The horizontal line represents the average Y axis value (of arguable interest in most cases). </p>
<p>One could also generate graphs using the entire set of URLs, tags, or users. Again, many such graphs are possible; here are some that answer what seem to me to be interesting questions: </p>
<p><img src="http://www.econometa.com/wp-images/post-images/taggraphs-all.gif" alt="Taggraphs (all)" /></p>
<p>After looking at all this, I realize that my previous post was a bit muddled on why Clay&#8217;s graph bothered me. The problem isn&#8217;t that the users were ordered by decreasing tag usage (also done above), it&#8217;s that the Y axis represented the number of <em>tags</em> ever used instead of the number of <em>taggings</em> performed. This makes the &#8220;area&#8221; under Clay&#8217;s curve in the long tail difficult to define, and so it&#8217;s hard (I think) to find a question that it answers. Or who knows, maybe if I look at it tomorrow, it&#8217;ll make perfect sense.</p>
<p>So, anyone out there up for making these graphs real&#8230;?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.econometa.com/archives/12/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Tagging and long tails</title>
		<link>http://www.econometa.com/archives/9</link>
		<comments>http://www.econometa.com/archives/9#comments</comments>
		<pubDate>Tue, 17 May 2005 16:58:37 +0000</pubDate>
		<dc:creator></dc:creator>
				<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://www.econometa.com/?p=9</guid>
		<description><![CDATA[Clay Shirky posted a great essay on social tagging vs. expert categorization. Tagging is a particularly interesting example of &#8220;stuff about stuff&#8221; being valuable, because it includes two extra ingredients: social network effects and the ability to address the &#8220;long tail&#8221; of both content and meaning. In a system like del.icio.us where each person can [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.shirky.com/">Clay Shirky</a> posted a <a href="http://shirky.com/writings/ontology_overrated.html">great essay</a> on social tagging vs. expert categorization. Tagging is a particularly interesting example of &#8220;stuff about stuff&#8221; being valuable, because it includes two extra ingredients: social network effects and the ability to address the &#8220;long tail&#8221; of both content and meaning. </p>
<p>In a system like <a href="http://del.icio.us">del.icio.us</a> where each person can tag a URL, value is created along two dimensions: the user who tagged it has presumably fit it into his/her own personal category scheme; and the aggregated tags of many users assign new semantic properties to the URL. Feedback can then be created between these dimensions via an auto-complete functionality (e.g. <a href="http://ejohn.org/projects/autodelicious/">John Resig&#8217;s extension</a>) that, at the moment when a user assigns a tag, displays tags assigned that URL by other users and/or algorithmically &#8220;related&#8221; tags based on what the user types.</p>
<p>Secondly, the fact that any user can assign a tag to any URL means that many more URLs can be tagged (at no cost) as compared to an expert categorization scheme, and that many more (weighted) meanings can be assigned to each URL. The tagging of obscure URLs addresses the &#8220;long tail&#8221; of URLs, while the aggregate &#8220;tag profile&#8221; of a given URL addresses the &#8220;long tail&#8221; of perceived meanings.</p>
<p>Clay&#8217;s essay covers these points in clear prose accompanied by some really helpful charts and graphs. One very minor issue I have, though, is with the <a href="http://shirky.com/writings/ontology_overrated.html#tag_distributions_on_del.icio.us">&#8220;Tag Distributions&#8221; chart</a>. Clay refers to &#8220;the characteristic long tail of people who use many fewer tags than the power taggers.&#8221; While this chart does exhibit a &#8220;long tail,&#8221; this is simply a result of the fact that the users were ordered by decreasing tag usage (also true of the following three charts) &#8212; the X axis here doesn&#8217;t represent a value, it is just a sequence of users.</p>
<p>The phrase &#8220;long tail&#8221; usually refers to the observation that for many distributions, the number of elements with outlying values (the &#8220;tail&#8221;) may be cumulatively significant compared to the number of elements clustered near the average. Clay might have not even been using the phrase in this way, but once a buzzword gets going, it&#8217;s best to use it as conservatively as possible (otherwise <a href="http://sapventures.typepad.com/main/2005/05/ramsay_on_the_l.html">people start getting pissed off</a>!).</p>
<p>Some &#8220;long tail&#8221; charts that would be interesting to see would be URLs by number of times a tag was assigned (showing whether the long tail of obscure URLs cumulatively comprises more tags assignments than the common URLs), or tags for a specific URL by number of times a tag was assigned (showing whether the long tail of obscure tags cumulatively comprises more tags assignments than the common tags). This last chart could also be averaged across many URLs to see if this long tail applies in general to arbitrary links.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.econometa.com/archives/9/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
