Software development metrics

Software development is a skill. And for the moment, I mean that in an institutional sense. It's a skill the organization has. It's something that an organization can be good at. And it's something that an organization can become better at, with practice. All too often though, development organizations don't practice. Of course they continue to develop software, but it's usually just repetition. The difference between practice and repetition is that with practice you get feedback and incorporate that into future sessions. Usually when you would talk about practice, it would be in the context of a sport, or maybe a class. The feedback in that case would come from a coach or professor. It would be an unusual software development organization that had a coach, though. And so, often they will just repeat the same process over and over without ever learning from it, and without ever improving.

That's where metrics come in. Metrics provide feedback, and you can use that feedback to improve. Even better than that, you can—to some extent—measure your improvement. But there is a well known fact about metrics: you get what you measure. And there's also a well known fact about people: we're lazy and self interested. If you start measuring something, they will make the easiest change they can in order to optimize for that new metric. This reaction is entirely reasonable if you think about it. You're telling people that X thing is important, and that maximizing (or minimizing) X is valuable. They believe you and try to do that. And they try to do it in the easiest or most obvious way, because why would they make things harder for themselves? Thus it's very important to consider not just what you're measuring, but what behavior it will promote. Because ultimately, what you should care about is the behavior; the metric is only a tool to encourage better behavior.

Now that we all agree we should be tracking some metrics, and why we should be doing that, what are good metrics to track? That's a good question, but the one I'm going to answer instead is: what's a bad metric? Code coverage is a bad metric (or test coverage if you want to think about it that way). If you don't believe me, just ask Microsoft Research. Of course, having more coverage from your software tests is a good thing. But, it's only a good thing if your tests are good. Tracking code coverage encourages writing tests that cover more things, and tests that cover more things are not as good. I'm not suggesting that you shouldn't try to understand what parts of your code base do and don't have test coverage. That's obviously valuable and useful information. Just don't make it part of your feedback loop.

The most useful metrics for software development will have some common attributes. You want them to be reliable, predictive, and automatic. You want them to measure some collection of things, so that focusing on one thing (like adding more coverage to tests) at the expense of the others (like writing good tests in the first place) won't actually improve the measured result.

This brings me back to the unanswered question: what is a good metric? I have some (reasonably) novel suggestions which I'll save for their own articles. For now, I'll suggest another metric that is already very common: defect density. I've never worked at any serious software organization that didn't track this somehow. But, there are multiple ways to define the term, and some are more useful than others. The worst is to decide that it's the number of defects per test case. The easy and obvious way to move that needle is to add more tests. Assuming that the actual quality of the code stays the same, adding more tests will reduce defect density, and that is clearly not the desired outcome. Doubly so if too much attention is paid to the number of tests. In that case, the almost certain result is that things which are easy to test will be tested by multiple redundant tests, and things that are hard to test will get much less attention. The better alternative is to define the metric as the number of defects per story. If you don't have user stories, then per feature, or per acceptance criteria, etc. This metric is still game-able, but the behavioral change is a more desirable one. In this scenario, even if code quality remains unchanged, writing more focused user stories (or more detailed acceptance criteria) will cause the defect density to go down. And as a side effect, you will likely get better code quality due to having better requirements to code against.