Metrics

2009-04-04

The new CEO at my work has directed the heads of all of the departments to institute metrics for measuring their departments. In a lot of the departments, that makes sense. Not so much in IT. We have one department for programmers, system/network admins, and computer support so finding metrics that would fit all of those was proving very difficult. As a programmer, I was against some of the ideas for tracking bugs as metrics outright. Our director asked for input and here's the email I sent in preparation for the meeting we were going to have on the subject, slightly modified for blog posting.

I’m going to comment on the larger issue of metrics in general.

My general thought is it that measuring software development in this way is fraught with danger and unintended consequences. I know this is coming down from above so there’s not much that can be done but things like metrics and performance-based bonuses have to be approached carefully or they risk incentivizing the wrong things and/or disincentivizing the thing you were trying to improve in the first place.

First, here’s a great article from Joel Spolsky on a related topic, incentive pay: http://www.joelonsoftware.com/articles/fog0000000070.html While this isn’t strictly about metrics, it’s a great read and one of my favorite explanations for unforeseen problems with processes that seem like good things at first. It also goes a little into important things you’ll never be able to have metrics for.

This is a very thoughtful comment made by someone on Joel Spolsky’s discussion forum about research done into metrics. It puts things in a way I would put them so I’m just going to paste it in.

“My manager recently asked me to do some preliminary investigations re: the use of metrics. What I came up with (below) drew heavily on various threads discussing the topic here on JoS (which were the source of most of the supporting quotes):

Metrics are very difficult to do well.

In the context of software engineering, “quality” and “productivity” are very hard to objectively quantify. In the words of Carnegie-Mellon’s SEI: “Unfortunately, most of the metrics defined have lacked one or both of two important characteristics: a sound conceptual, theoretical basis; and statistically significant experimental validation”

As one programmer put it, “… varying projects have wildly differing levels of difficulty. If my colleague spends two weeks writing a reusable, documented thread pooling class, and I spend two weeks dropping controls on forms, he may end up with 200 lines of code, and 5 bugs, I may end up with 2000 lines of code and no bugs. Really, he’s the hero and I’m average. But how will metrics explain this?”

Unless done well, metrics do more harm than good.

Beware the ‘law of unintended consequences’ - accidentally creating incentives/disincentives for the wrong thing(s). Example: if checkins are used as a measure of productivity, incentive is created to check in more often (e.g. at the end of every workday) rather than when it makes sense to do so (e.g. when coding + unit test are complete).

“What ever you decide to measure is what you are going to get… And you’ll get NOTHING else. These sorts of extrinsic measurements (and rewards based thereon) cause you to be less focused on your work and more on the extrinsic measurements. You’ll be thinking “how many hours did I bill today” instead of “what button name is going to be clearest to the customer -resulting in fewer tech support calls, happier customers, and higher net revenue”.

Lines of code is not a reliable indicator of quality or productivity.

Implicit assumption that more code = better, when in software often precisely the opposite is often true. Example: if lines of code are used to measure productivity, then there’s incentive to cut and paste duplicate blocks of code (the larger the better) instead of creating re-usable functions, so as to artificially inflate ‘output’.

“… the best programmer is generally the one who takes the most code away, not the one who adds the most code …”
“The best developers … spend the bulk of their time analyzing the problem, and a small portion cranking out compact, clean code.”
“… a one-line PERL program can be much harder to understand (and therefore more complex) than a 10 line one that does the same thing…”

Metrics should NEVER be used to rate individuals, e.g. for performance evaluations and/or to determine compensation

Effort will be expended (often successfully) to “game” the system. Example: if developers are rewarded for fixing bugs, there is incentive to intentionally introduce bugs (even if presumably easy-to-repair ones) to increase opportunities to garner rewards. Alternatively, if number of reported bugs in a developer’s code is a factor in their appraisal, there is a strong DISincentive for QA to report bugs - either the bug tracking system will be bypassed, or bugs will simply go unreported.

Other examples are as cited in 2) and 3) above.

If correctly designed, aggregated metrics CAN be useful for measuring the productivity of a team. Collecting metrics for an entire project over time can mitigate some of the local variability that leads to the weaknesses described above. But they must be collected consistently over time until a meaningful body of history is accumulated, and even then the limitations of the metrics so gathered must be understood and acknowledged.”

I’m 100% behind trying to produce better quality work. I work hard on my own to improve my code and my processes. If the goal is to improve quality though, it’s much more complicated than just getting some numbers together and seeing if they go up or down in 6 months. Improving quality is a process, and a complicated one at that. One of the things my coworker and I spoke about when he was getting ideas for QA is the philosophy of owning quality for the whole process of development, not just testing. One part of that process could be using the various code complexity and static bug analysis tools on our code. We could use this information, over time, to help increase the quality of our work. But if we were to just implement these tools so someone could record the number of bugs and use that against us in a review, the incentive to utilize the tool to help overall quality would be reduced. I’m not sure what the answer is if we’re required to find some numbers to record and post. It’s a hard problem and one we for sure need to think hard about.

At our meeting we ended up choosing 3 metrics that everybody could live with and we dodged the bullet of trying to use our bug tracking system to produce metrics. That was my goal for the meeting so I’m happy with how it turned out. The programmers also decided that we were going to work on our own informal processes for improving quality and do our peer metrics to help with this, which will be great. I’ll talk more about this later as we work on it.

Code