Michael McDonald :: acting   blog   consulting   noel   contact

## Standard Deviation for Programmers

If you are a programmer, read this article: Programmers Need To Learn Statistics Or I Will Kill Them All. Maybe I have a soft spot for curmudgeonry from working for Prof. Soloway years ago, but Zed has a very good point and I was guilty of many of the things he's complaining about, happily calculating naked means and saying 1% sounds like a good ratio. Bad Mike.

So once I was convinced that if I was going to calculate a mean (which is lossy in the sense of reducing a photograph to a single pixel) I should at least attach a standard deviation as well. But calculating a mean is easy in code: just add everything into one big 'total' value and keep a count of items in a 'count' value, then divide. Most introductions to the standard deviation show you something like this formula:

The problem is that this formula implies that all the values in the population must be known in order to calculate the standard deviation. To calculate the standard deviation of 10 million values, suddenly you need to store the 10 million values, calculate the mean (x with a bar over it), then iterate through the 10 million values to calculate the standard deviation. This is bad. Statistics people assume that you are analyzing a full data set in some statistics application, but programmers often need statistics for analyzing streams of data, not small static populations.

I dug deeper, and finally found this variation of the formula:

This may have been obvious to the mathematically inclined, but I'm a programmer so really I do a lot more counting than math. (And I don't even need to be right the first time, because I use test-driven development.) You can calculate a running standard deviation by keeping track of the count of values, sum of values, and the sum of the squares of values. Here's some sample code:

``````
private double valueCount = 0;
private double sumOfValues = 0;
private double sumOfSquaredValues = 0;
valueCount++;
sumOfValues += value;
sumOfSquaredValues += value*value;
}
void printStats () {
System.out.println("# values: " + valueCount);
double mean = sumOfValues / valueCount;
System.out.println("mean: " + mean);
double standardDeviation = Math.sqrt((sumOfSquaredValues - ((sumOfValues*sumOfValues) / valueCount)) / (valueCount - 1));
System.out.println("standard deviation: " + standardDeviation);
}
``````

Recommended reading: How to Lie with Statistics