About this post
ABOUT: This entry was posted July 1, 2007 at 10:43 p.m. It is 937 words long, which, in case you're curious, translates to about 26 inches. There are currently 0 comments on this post. Click here to add your own.
SUMMARY: Some tidbits I've collected over the last month and a half.
TAGS: Site | Python | Data mining | Databases
Spread the love
- subscribe to its comments
- bookmark it on del.icio.us
- digg it
- bookmark it on ma.gnolia
- seed it to newsvine
- see who is bookmarking it
Recent posts
Yes, I'm still alive
Many apologies for not posting lately. I'm buried in a few projects that will hopefully launch/run in a month or two. During my month of silence, I've seen some interesting topics pop up in the Journerdist (apologies to Will) blogosphere and I've played with some cool technologies, so I thought I'd cobble together some tidbits.
First a note about the site: Remember how I wrote an S3 backup system specifically so I wouldn't erase all my comments again? Well, someone should have reminded me to use it. My advice: Never write a delete query without testing it somewhere first.
Now, the main course:
Journalists as programmers
There has been a lot of discussion lately about whether and how much journalists should learn to program. I think Matt and Will said it best in different ways.
Do most journalists need to learn how to code? No. Should some? Yes. If you're one of them, get used to teaching yourself. Embrace DIY. But if you just plain hate it, that's okay, too.
I started programming in junior high because it was fun. I stuck with it for a few years, set it down, and picked it up again in college. It still was fun; it still IS fun; so I keep at it.
Bottom line: If you like programming, do it. If you don't, find another technology you enjoy: Flash, CSS design, audio/video -- whatever. No matter what, you can't avoid technology. Learn something fun so it becomes a hobby, not a slog.
P.S. You won't be hurting for a job if your tech-fu is strong. I know of a few openings right now. Let me know if you're interested.
Amazon EC2
After a couple months of waiting, I finally got into the EC2 open beta this week. I'm pretty excited -- particularly because my CAR setup could use a dirt-cheap, infinitely scalable computing cluster to shred through resource-intensive tasks.
A couple downsides: EC2 instances don't offer persistent storage. Every time you shut one down, all your data, files, and everything else you changed disappears. Likewise, whenever you start a new instance, you have to reload everything. No biggie if you're using it for one-off tasks -- just store your stuff in S3, where transfer to EC2 is free. But if you're running something where uptime is a concern, you better have a plan for redundancy. It's not a deal-breaker, but it is annoying.
Not quite as bad is the fact that every new instance has a new IP address. This makes sense, since your instances will always appear in different parts of Amazon space, but you'll need to account for it in off-site scripts, DNS, etc.
Everyblock
Way to go, Adrian.
Like a lot of people, I'm excited to see how this turns out. But the pessemist in me worries about the negative externalities projects like this could create.
Think about this: By enabling and encouraging people to request and disseminate massive public databases -- and lowering the barrier to competition within local markets -- how do you think the records custodians will react?
Say 10 new local startups each start asking for fresh slices of real estate data once a week. The public agency holding that data is left with two choices: streamline its data distribution (a potential positive externality) or slam the door. The latter is probably cheaper and easier.
We can argue about open records entitlement until we're blue in the face, but most journalists know how formidible administrative hurdles can be. To say nothing of laws that allow agencies to charge more from requestors likely to use their data commercially. Or agencies looking to monetize their data themselves.
Open source software that helps shine light on government is a great idea. We should just keep an eye out for unintended consequences so we can preempt them accordingly.
Record linkage
Common database knowledge says it's impossible to join two tables on name and address fields alone. But a technique called record linkage has made headway in changing that.
Over the last couple months, I've used two pieces of software -- LinkPlus and febrl -- with varying degrees of success. Febrl (which is a Python/C app) in particular has been extremely effective taking names from one dataset and matching them with names in another -- variations, misspellings and all.
The science behind this stuff is pretty complex, what with its data mining, probabilities and various matching algorithms, but it's very, very useful. Expect more on this later.
Ethical hacking?
Recently, I discovered a large security vulnerability in a government Web application I was asked to beta test. It turned out its input forms were succeptible to an exploit known as SQL injection -- a trick CAR folk should get familiar with, especially if they're developing for the Web.
I told the agency about the problem so they could fix it, but there are two morals to the story: 1.) You should at least understand the concepts of SQL injection, cross-site scripting and other basic Web exploits if you're doing app development; and 2.) you should also know them if you ever need to dissect sites for information.
If you know how Web sites are built, you'll know better how to manipulate them: forcing wildcard dumps, manipulating GET and POST, mining robots.txt -- that kind of thing.
And we're out
That's it for now. Expect posts on MapServer, marrying Google Maps with a WMS, record linkage and some spiffy Web tips ... eventually.

Post your comment