About this post

ABOUT: This entry was posted September 28, 2008 at 11:22 p.m. It is 910 words long, which, in case you're curious, translates to about 26 inches. There are currently 2 comments on this post. Click here to add your own.

SUMMARY: In which I explain the advantages of using a popular fraud-detection tool in reporting.

TAGS: Databases


Spread the love


Recent posts

Sunday, September 28th, 2008
In which I explain the advantages of using a popular fraud-detection tool in reporting.

Monday, March 17th, 2008
In which I describe how to use MySQL's spatial functions and Python to do point-in-polygon detection.

Sunday, July 1st, 2007
Some tidbits I've collected over the last month and a half.

Saturday, May 5th, 2007
Backing up your databases is easy with S3, boto and mysqldump.

Monday, April 30th, 2007
In which I plead with you to learn from our deployment mistakes.

Applying Benford's Law to CAR

Posted Sunday, September 28th, 2008 at 11:22 p.m.

In case you missed it at IRE Miami this year, Phil Meyer and Steve Doig put on a great panel about techniques reporters could and should be applying, but, for whatever reason, are not.

One of the techniques Meyer mentioned is known as Benford's Law -- a decades-old mathematical rule that forensic accountants have recently used to spot fraud by examining the distribution of individual digits in large datasets. I've been meaning to test it out for a long time, ever since I came across this old New York Times article earlier this year, but I never took the time until a couple weeks ago.

Say you've got a budget. Using Benford's Law, you'd look at the first digit (or second, third, last, etc.) of each figure listed in the budget. Most people assume that individual digits in something like a budget are randomly distributed -- a 1 is just as likely to be the first digit of any given line item as a 2, a 5 or a 9 -- when in fact that isn't the case. When you're talking about the first digit across line items, a 1 is exponentially more likely to occur than larger digits -- a proportion that evens out at you look at the second, third and all the way up to the final digit in a set of line items.

If you're lost, don't worry. You'll see what I mean soon enough. The important thing is that checking to see whether individual digits occur at the expected rates can reveal indications of fraud -- particularly when you're looking at data that people can fudge. Think about your expense reports: Do you always enter your expenses to the exact cent? Probably not. Maybe you round up, round down, or play with the numbers a bit if you don't have a receipt. When people alter numbers, they don't tend to do so in accordance with Benford's Law, hence the reason it can serve as an indicator.

Forensic accountants apply this law from time to time. So do academics. So if it's good enough for them, why not use it in journalism?

So, inspired by that idea (and this paper), I decided to apply Benford's Law to several datasets that were collecting dust on my desktop, namely: campaign expenses for six local politicians; a register of credit card transactions from a large government agency; and, as a control, my own debit card records.

How it works in practice

Testing a dataset against Benford's law is easy. At its most basic level, all it requires is a short SQL statement that invokes a simple string function:

SELECT left(amount, 1), count(*)
FROM expenses
GROUP BY left(amount, 1)
ORDER BY 1 asc


That query would show you a frequency distribution of first digits in a field called "amount." You could swap the "left" function for a "right" function to test the last digit, or use "mid" to pick out some digits in the middle.

A bit more complicated is understanding the significance of the results. In short, the law says the distribution of first digits in a large set of numbers should come out looking like a power law distribution:

However, as you test the second and third digits, and so on, you can expect the distribution to progressively flatten out until it looks somewhat uniform. Deviations from this expected distribution should raise questions. For example, maybe employees are simply rounding off their expenses -- or maybe they're fudging them altogether.

The results

The three datasets I tested each had quirks but mostly held up to the distributions predicted by Benford's law. An analysis of about 5,000 campaign expenses by local politicians resulted in a first-digit distribution that almost exactly fit their expected values. Campaign contributions to the same candidates did not, but only because many donors favored making $5,000 contributions -- sort of an informal limit in state and county politics here.

The government credit card transactions were more interesting. The first-digit distribution came out as expected:

However, the distribution of final digits seemed askew:

Clearly, the distribution favors the number 5, which accounted for about 19 percent of all final digits in the nearly half-million transactions. The numbers 9 and 8 also noticeably exceeded their expected proportions.

There may be perfectly good explanations for this. For example, maybe many of this agency's purchases were exempt from sales taxes, which would make their price end with 99 or 98 cents. Or maybe employees make a habit of reporting, say, $23 transactions as $25 transactions, for ease of record-keeping.

There's no telling. But the deviations here would definitely raise some questions (especially because my own debit card transactions, which I know haven't been fudged, match with the law nicely).

What it means for reporters

You can't support a story on it, but Benford's Law can tip you off when something is amiss. Campaign finance, voter turnout, crime statistics, tax records and government expenses are all ripe for this type of analysis, as is anything else in which people have an incentive to manipulate numbers.

For what little effort it requires, there's little reason for reporters not to check their data against the distributions expected under Benford's Law. I plan to add it to my standard library of early data-exploration queries, and I'd suggest other reporters do the same.

Comments | Post yours

  1. Ryan McNeill 12:05 p.m. on September 29

    This is way, way interesting. Thanks for this.

  1. bvo 9:22 a.m. on October 6

    Thank you indeed!

    It would be very interesting if you would like to tell us something more about your “standard library of early data-exploration queries”...

Post your comment

Optional