About this post

ABOUT: This entry was posted January 16, 2007 at 10:44 p.m. It is 1053 words long, which, in case you're curious, translates to about 30 inches. There are currently 0 comments on this post. Click here to add your own.

SUMMARY: In which I demonstrate an implementation of logarithm-based tag clouds in Python.

TAGS: Python


Spread the love


Recent posts

Sunday, September 28th, 2008
In which I explain the advantages of using a popular fraud-detection tool in reporting.

Monday, March 17th, 2008
In which I describe how to use MySQL's spatial functions and Python to do point-in-polygon detection.

Sunday, July 1st, 2007
Some tidbits I've collected over the last month and a half.

Saturday, May 5th, 2007
Backing up your databases is easy with S3, boto and mysqldump.

Monday, April 30th, 2007
In which I plead with you to learn from our deployment mistakes.

Log-based tag clouds in Python

Posted Tuesday, January 16th, 2007 at 10:44 p.m.

For both personal and professional reasons, I've been spending bits of my off time lately learning about tag clouds -- the Web 2.0 weighted lists that present a quick-and-dirty visualization of word frequencies, among other things.

Being a person who likes to tinker, I wanted to build one for myself. So I staked out a few blog posts and white papers and settled on coding something I thought seemed useful: a logarithmic font distribution algorithm, written in Python because I couldn’t find a Python implementation of it already.

It sounds complicated, but you’ll soon see it isn’t. Before I get into the how, check out this sample: a log-based tag cloud of 50 recent NICAR-L posters, weighted by the total number of posts credited to their e-mail addresses. Props to Brian Hamman for giving me the idea. It’s hardly comprehensive. I just pulled 50 names from the January 2007 postings, retrieved total post counts by e-mail address, and plugged them into a spreadsheet. Folks like Brant Houston and Jeff Porter who have used the same e-mail address for years have a bazillion posts to their name. Those like Mary-Jo Webster, who recently changed her name and e-mail address, are badly undercounted. If enough people ask, I’ll get the data from IRE and do it right. But for now, just look at it as an illustration.

The theory

According to my brief review of the science, tag clouds typically come in two flavors: those based on equal assignment and those based on logarithms. Depending who you ask, the difference can be huge.

Equal assignment algorithms take a large list of tags and their associated counts and chop them into equal slices, assigning a font size to each one. That’s all well and good, except the smaller, less common tags receive the same weight as the large, rarely used ones. But in reality, most tag taxonomies shake out like this:

You might remember this from your entry-level calculus nightmares or books like Chris Anderson’s The Long Tail. It’s called a power-law distribution, a Pareto distribution, a Zipf distribution, or any number of other things. Lots of patterns in nature come this way, including – lo and behold – the NICAR-L postings (see below). By taking your list of tags and counts and slicing it to fit to this distribution, you in theory end up with a tag cloud that better represents your data.

The code

Here’s the part where I run with scissors. I didn't create this algorithm. Instead I implemented this one as a Python function, mostly because it was quick and easy. I might integrate it into a broader class if I ever have the time, but right now, the code looks like this:

I’ll break it down. The function takes two arguments: the number of thresholds (different tag sizes) you want to create and a list of tuples in the form ('tag', count). The stuff on top is pretty self-explanatory. It validates the input, creates some necessary lists, yadda yadda. The interesting stuff comes about here:

The newDelta variable represents the average interval size between your maximum and minimum values, just like you would use in an equal interval model. The function calculates these automatically from your input. In the log-based algorithm, newDelta works as a multiplier within the log calculations, like so:

According to the white paper, his chunk takes the number of thresholds you specify and creates them to fit the log distribution. It makes for narrow intervals with a few tags at the head of the curve and wide ones with many tags in the tail – just like the distribution shows. The next chunk tests every item in the input list and assigns it to one of the thresholds that were just created:

The object returned is a list of dictionaries, with the tag name followed by a number between 1 and the value of the steps argument. You can parse this however you want. For the NICAR-L cloud, a Django view creates some HTML and passes it to a template. Font sizes are simply the threshold number (in this case, 1 through 6) in CSS "em" units.

The reason

Tag clouds aren't revolutionary, but they display complex information in a simple, interesting way that journalists can benefit from. The Seattle P-I has already jumped on board, using a PHP algorithm open sourced by Chirag Mehta to illustrate themes in Microsoft documents over time. Mehta's method adds word stemming and exclude lists (which I've also implemented in Python and will explain in a later post) to make relevant tag clouds out of speeches and other unstructured blobs of text.

Tag clouds are a cheap and easy way for CAR people to collaborate with Web people, not to mention an excuse (for better or worse) to boost newspapers on to the Web 2.0 bandwagon. Feel free to use my code if it helps you. Tear it to shreds, too -- it probably deserves it.

Post your comment

Optional