Michael McDonald's Blog

Standard Deviation for Programmers

If you are a programmer, read this article: Programmers Need To Learn Statistics Or I Will Kill Them All. Maybe I have a soft spot for curmudgeonry from working for Prof. Soloway years ago, but Zed has a very good point and I was guilty of many of the things he's complaining about, happily calculating naked means and saying 1% sounds like a good ratio. Bad Mike.

So once I was convinced that if I was going to calculate a mean (which is lossy in the sense of reducing a photograph to a single pixel) I should at least attach a standard deviation as well. But calculating a mean is easy in code: just add everything into one big 'total' value and keep a count of items in a 'count' value, then divide. Most introductions to the standard deviation show you something like this formula:

5c907f11e917970cab7ade2ba61531e6_1

The problem is that this formula implies that all the values in the population must be known in order to calculate the standard deviation. To calculate the standard deviation of 10 million values, suddenly you need to store the 10 million values, calculate the mean (x with a bar over it), then iterate through the 10 million values to calculate the standard deviation. This is bad. Statistics people assume that you are analyzing a full data set in some statistics application, but programmers often need statistics for analyzing streams of data, not small static populations.

I dug deeper, and finally found this variation of the formula:

60036de27d964f9eb8f43add1cac001e

This may have been obvious to the mathematically inclined, but I'm a programmer so really I do a lot more counting than math. (And I don't even need to be right the first time, because I use test-driven development.) You can calculate a running standard deviation by keeping track of the count of values, sum of values, and the sum of the squares of values. Here's some sample code:


private double valueCount = 0;
private double sumOfValues = 0;
private double sumOfSquaredValues = 0;
void addValue (double value) {
	valueCount++;
	sumOfValues += value;
	sumOfSquaredValues += value*value;
}
void printStats () {
	System.out.println("# values: " + valueCount);
	double mean = sumOfValues / valueCount;
	System.out.println("mean: " + mean);
	double standardDeviation = Math.sqrt((sumOfSquaredValues - ((sumOfValues*sumOfValues) / valueCount)) / (valueCount - 1));
	System.out.println("standard deviation: " + standardDeviation);
}

Recommended reading: How to Lie with Statistics

link  |   |  4/4/06 12:14pm
 
home  |  acting  |  blog  |  consulting  |  noel  |  contact
© 2013 Michael McDonald, . All rights reserved.