About this post
ABOUT: This entry was posted February 11, 2007 at 8:17 p.m. It is 910 words long, which, in case you're curious, translates to about 26 inches. There are currently 0 comments on this post. Click here to add your own.
SUMMARY: In which I outline one technique to apply tag cloud generation to speeches.
TAGS: Python
Spread the love
- subscribe to its comments
- bookmark it on del.icio.us
- digg it
- bookmark it on ma.gnolia
- seed it to newsvine
- see who is bookmarking it
Recent posts
Tag clouds 2: Word stemming for speeches
Over the last few weeks, I’ve had reason to extend and employ my log-based tag cloud function for a couple projects here at the Chron – specifically a couple of speeches: the State of the City by Houston Mayor Bill White and the State of the State by Texas Gov. Rick Perry. The results were neat and they didn’t take much time. Here, I’ll show you how it was done.
First, I should summarize my last post: Tag clouds are cool. The best-looking ones come from an algorithm that employs logarithms. I implemented this algorithm in Python, and it works like a snap. I’ll be submitting it as a Django feature branch as soon as I have time to breathe.
Although my example implementation was designed to work with word/count tuple pairs, adapting it to work with speeches was actually quite easy. I’ll warn you now that my code is pretty messy. I wrote this on deadline, so elegance was not first priority.
First things first
If you’ve ever looked carefully at the composition of a speech, you’ll understand how creating tag clouds can get a little sticky. Words like “and”, “the” and “a” are scattered everywhere and skew the cloud into a boring mess. In addition, words like “run”, "runner" and “running” are counted separately, despite essentially conveying the same meaning. These are the first two problems a good cloud generator must tackle.
The first issue is the easiest to resolve. If you want your program to exclude common words, simply make a list of words you want it to overlook. I keep mine in a separate file, called (surprise) exclude.py. Within that file is a Python list containing several hundred words: “and”, “the”, “a”, etc.
Developing a list can take a while, so I suggest you steal one from another tag cloud generator and modify it to suit your needs. You’ll find yourself adding plenty of words to help shape the cloud to its input speech, with the ultimate goal being an accurate reflection of its central themes (some interpretation required).
Next, and not quite so easy, we have to instruct the main loop to account for different forms of the same word. The Porter Stemming Algorithm, written about 40 years ago, does a great job of reducing words to their roots – so “run”, “runner” and “running” all become just “run”. Thankfully, someone much smarter than I already implemented it in Python. All we need to do is plug it in:
This is going to take some explaining. After the file creation and variable initializations, the first loop creates a dictionary of all the words in the speech keyed to their frequencies, cutting out punctuation – another minor hiccup – using regular expressions. Our three forms of the word “run” would look something like this:
[{‘run’: 3}, {‘running’: 12}, {‘runner’: 5}]
At the end of that loop, a command stems each word using the Porter Stemmer, adds it to a dictionary called stemdict, and groups all the words that share that root into a dict list behind it, with this being the result:
{‘run’: [{‘run’: 3}, {‘running’: 12}, {‘runner’: 5}]}
The second loop cycles through each item in stemdict, adding up counts of each word that shares a given root. In this case, the total count would be 20 (3 + 12 + 5). The loop then uses a simple sorting function to select the most popular word belonging to each root – running – and assigns the total count to that word, placing the pair in a tuple, which is then added to a list:
(‘running’, 20)
Banned words are then removed from the tuple list, which is subsequently sorted. The resulting list is passed to the makeCloud function I wrote about last time. The function returns sufficient output to dynamically generate the HTML and CSS necessary for the cloud.
Let’s review
My ugly code makes this process seem a lot more complicated than it is. Essentially, all we’re doing is stemming each word, finding out the most popular word belonging to each root, and assigning the aggregate total of all words under that root to that single, most often-used term.
The result is an approximation, but it’s quite effective. And the Porter Stemmer is the key to making it work. In fact, Porter’s algorithm is versatile enough to be useful in any kind of text analysis. It’s a handy tool for the CAR geek’s toolbox, especially as content analysis becomes more popular.
Of course, if you don’t want to trouble yourself with the ins and outs of tag cloud creation, you can always use Chirag Mehta’s open source Tagline Generator. But what fun would that be without knowing how it works? =)

Post your comment