Martin Gladdish

Software 'n' stuff

Statistical Analysis in Software Development

I was struck, whilst reading Ben Goldacre’s excellent book, Bad Science, by the sheer volume of data surrounding medicine. Whether it’s disease rates in populations, probability of side effects per dosage, number of citizens with a particular condition, there is a huge volume of data to analyze and cross-reference.

The bit in particular that struck me is that these are very good datasets surrounding incredibly subjective measures; treatments, illnesses, symptoms and the like are all very woolly and open to wide interpretation, yet the medical community is still able to bring the full weight of statistics to bear (especially the bit about Cochrane meta-analysis of pre-existing studies that are able to spot statistically significant trends across studies that each of the individual studies were not able to do).

Software development doesn’t have anything like the same rigour and still completely relies upon subjective intuition applied to vanishingly small datasets (even huge multinational consultancies, working on hundreds of projects at any one time, are still working with a vanishingly small proportion of the total number of software projects).

I reckon there’s real scope for someone who really understands statistics to change the way we understand development techniques and methodologies. There isn’t even a recognised set of data that we should be collecting about software projects on which to found comparisons. Estimated time, actual effort, number of defects and cyclomatic complexity are all very good places to start, just off the top of my head, but I’m sure everyone will agree that much more measures will be necessary to build up a reasonable picture. Yes, there’s a growing number of people who think that agile development works and is more productive, produces code more quickly with less defects, delivers better business value and the like. I agree; from my wholly subjective perspective I think the projects that have been more agile have worked better, but my point here is that subjective opinion is not enough. Show me the numbers.

The downside is that I recognise I’m not even remotely qualified enough to perform this analysis myself. My understanding of statistics is rubbish. Does anyone know of suitable statistics resources?