Measuring Software

A while back I read Making Software – it made me disappointed at the state of academic research into the practice of developing software. I just read Leprechauns of Software Engineering which made me angry about the state of academic research.

Laurent provides a great critique of some of the leprechauns of our industry and why we believe in them. But it just highlighted to me how little we really know about what works in software development. Our industry is driven by fashion because nobody has any objective measure on what works and what doesn’t. Some people love comparing software to the medical profession, to torture the analogy a bit – I think today we’re in the blood letting and leeches phase of medical science. We don’t really know what works, but we have all sorts of strange rituals to do it better.

Rituals

Rituals? What rituals? We’re professional software developers!

But of course, we all get together every morning for a quick status update. Naturally we answer the three questions strictly, any deviations are off-topic. We stand up to keep the meeting short, obviously. And we all stand round the hallowed scrum kanban board.

But rituals? What rituals?

Do we know if stand ups are effective? Do we know if scrum is effective? Do we even know if TDD works?

Measuring is Hard

Not everything that can be measured matters; not everything that matters can be measured

I think we have a fundamental problem when it comes to analysing what works and what doesn’t. As a developer there are two things I ultimately need to know about any practice/tool/methodology:

  1. Does it get the job done faster?
  2. Does it result in less bugs / lower maintenance cost?

This boils down to measuring productivity and defects.

Productivity

Does TDD make developers more productive? Are developers more productive in Ruby or Java? Is pairing productive?

These are some fascinating questions that, if we had objective, repeatable answers to would have a massive impact on our industry. Imagine being able to dismiss all the non-TDD doing, non-pairing Java developers as being an unproductive waste of space! Imagine if there was scientific proof! We could finally end the language wars once and for all.

But how can you measure productivity? By lines of code? Don’t make me laugh. By story points? Not likely. Function points? Now I know you’re smoking crack. As Ben argues, there’s no such thing as productivity.

The trouble is, if we can’t measure productivity – it’s impossible to compare whether doing something has an impact on whether you get the job done faster or not. This isn’t just an idle problem – I think it fundamentally makes research into software engineering practices impossible.

It makes it impossible to answer these basic questions. It leaves us open to fashion, to whimsy and to consultants.

Quality

Does TDD help increase quality? What about code reviews? Just how much should you invest in quality?

Again, there are some fundamental questions that we cannot answer without measuring quality. But how can you measure quality? Is it a bug or a feature? Is it a user error or a requirements error? How many bugs? Is an error in a third party library that breaks several pages of your website one bug or dozens? If we can’t agree what a defect is or even how to count them how can we ever measure quality?

Subjective Measures

Maybe there are some subjective measures we could adopt. For example, perhaps I could monitor the number of emails to support. That’s a measure of software quality. It’s rather broad, but if software quality increases, the number of emails should decrease. However, social factors could so easily dwarf any actual improvement. For example, if users keep reporting software crashes and are told by the developers “yeah, we know”. Do you keep reporting it? Or just accept it and get on with your life? The trouble is, the lack of customer complaints doesn’t indicate the presence of quality.

What To Do?

What do we do? Do we just give up and adopt the latest fashion hoping that this time it will solve all our problems?

I think we need to gather data. We need to gather lots of data. I’d like to see thousands of dev teams across the world gathering statistics on their development process. Maybe out of a mass of data we can start to see some general patterns and begin to have some scientific basis for what we do.

What to measure? Everything! Anything and everything. The only constraint is we have to agree on how to measure it. Since everything in life is fundamentally a problem of lack of code, maybe we need a tool to help measure our development process? E.g. a tool to measure how long I spend in my IDE, how long I spend testing. How many tests I write; how often I run them; how often I commit to version control etc etc. These all provide detailed telemetry on our development process – perhaps out of this mass of data we can find some interesting patterns to help guide us all towards building software better.

Measuring Code

How good is your code? If you’re like the other 80% of above average developers, then I bet your code is pretty awesome. But are you sure? How can you tell? Or perhaps you’re working on a legacy code base – just how bad is the code? And is it getting better? Code metrics provide a way of measuring your code – some people love ’em, but some hate ’em.

The Good

Personally I’ve always found metrics very useful. For example – code coverage tools like Emma can give you a great insight into where you do and don’t have test coverage. Before embarking on an epic refactor of a particular package, just how much coverage is there? Maybe I should increase the test coverage before I start tearing the code apart.

Another interesting metric can be lines of code. While working in a legacy code base (and who isn’t?), if you can keep velocity consistent (so you’re still delivering features) but keep the volume of inventory the same or less, then you’re making the code less crappy while still delivering value. Any idiot can implement a feature by writing bucket loads of new code, but it takes real craftsmanship to deliver new features and reduce the size of the code base.

The Bad

The problem with any metric is who consumes it. The last thing you want is for an over eager manager to start monitoring it.

You can’t control what you can’t measure
— Tom DeMarco

Before you know it, there’s a bonus attached to the number of defects raised. Or there’s a code coverage target everyone is “encouraged” to meet.

As soon as there’s management pressure on a metric, smart people will game the system. I’ve lost count of the number of times I’ve seen people gaming code coverage metrics. In an effort to please a well meaning but fundamentally misguided manager, developers end up writing tests with no assertions. Sure, the code ran and didn’t blow up. But did it do the right thing? Who knows! And if you introduce bugs, will your tests catch it? Hell, no! So your coverage is useless.

The target was met but the underlying goal – improving software quality – has not only been missed, it’s now harder to meet in future.

The Ugly

The goal of any metric is to measure something useful about the code base. Take code coverage, for example – really what we’re interested in is defect coverage. That is, out of the universe of all possible defects in our code, how many would cause a failure in at least one test? That’s what we want to know – how protected are we against regressions in the code base.

The trouble is, how can I measure “the universe of all possible defects” in a system? Its basically unknowable. Instead, we use code coverage as an approximation. Given that tests assert the code did the right thing, the percentage of code that has been executed is a good estimation of the likelihood of bugs being caught by them. If my tests execute 50% of the code, at best I can catch bugs in 50% of the code. If there are bugs in the other 50%, there’s zero chance my tests will find them. Code coverage is an upper bound on test coverage. But, if your tests are shoddy, test coverage can be much lower. To the point where tests with no assertions are basically useless.

And this is the difficulty with metrics: measuring what really matters – the quality of our software – is hard, if not impossible. So instead we have to measure what we can, but it isn’t always clear how that relates to our underlying goal.

But what does it mean?

There are some excellent tools out there like Sonar that give you a great overview of your code using a variety of common metrics. The trouble often is that developers don’t know (or care) what they mean. Is a complexity of 17.0 / class good or bad? I’m 5.6% tangled – but maybe there’s a good reason for that. What’s a reasonable target for this code base? And is LCOM4 a good thing or a bad thing? It sounds like a cancer treatment, to be honest.

Sure, if I’m motivated enough I can dig in and figure out what each metric means and we can try and agree reasonable targets and blah blah blah. C’mon, I’m busy delivering business value. I don’t have time for that crap. It’s all just too subtle so it gets ignored. Except by management.

A Better Way

Surely there’s got to be a better way to measure “code quality”?

1. Agree

Whatever you measure, its important the team agree and understand what it means. If there’s a measure half the team don’t agree with, then its unlikely it will get better. Some people will work towards improving it, others won’t so will let it get worse. The net effect is likely to be heartache and grief all round.

2. Measure What’s Important

Downward chartYou don’t have to measure the “standard” things – like code coverage or cyclomatic complexity. As long as the team agree its a useful thing to measure, everyone agrees it needs improving and can commit to improving it – then its a useful measure.

A colleague of mine at youDevise spent his 10% time building a tool to track and graph various measures of our code base. But, rather unusually, these weren’t the usual metrics that the big static analysis tools gather – these were much more tightly focused, much more specific to the issues we face. So what kind of things can you measure easily yourself?

  • If you have a god class, why not count the number of lines in the file? Less is better.
  • If you have a 3rd party library you’re trying to get rid of, why not count the number of references to it.
  • If you have a class you’re trying to eliminate, why not count the number of times its imported?

These simple measures represent real technical debt we want to remove – by removing technical debt we will be improving the quality of our code base. They can also be incredibly easy to gather, the most naive approach only needs grep & wc.

It doesn’t matter what you measure, as long as the team believe whatever you do measure should be improved; then it gives you an insight into the quality of your code base, using a measure you care about.

3. Make It Visible

Finally, put this on a screen somewhere – next to your build status is good. That way everyone can see how you’re doing and gets a constant reminder that quality is important. This feedback is vital – you can see when things are getting better and, just as importantly, when things start to slip and the graph veers ominously upwards.

Keep It Simple, Stupid

Code quality is such an abstract concept its impossible to measure. Instead, focus on specific things you can measure easily. The simpler the metric is to understand the easier it is to improve. If you have to explain what a metric means you’re doing it wrong. Try and focus on just a few things at any one time – if you’re tracking 100 different metrics its going to be sheer luck that on average they’re all getting better. If we instead focus on half a dozen, I can remember them – the very least I’ll do is not let them get worse; and if I can, they’ll be clear in my mind so I can improve them.

Do you use metrics? If so, what do you measure? If not, do you think there’s something you could measure?