People often ask what metrics can be used to assess the health of a process. They usually bring up team-level or individual metrics like throughput (how fast a person/team pushes work out the door) or its inverse, cycle time (time from when a team starts work on something until it finishes). None of these are particularly helpful.
However, the four DORA* metrics (expanded a bit from a pure Ops focus) are universal and work well:
(1) Lead time (time from idea/problem discovery until a solution is in the customer's hands). This is NOT a team-level metric unless a single team handles the entire process from discovery to delivery. It is NOT cycle time. It is NOT throughput. I should add that I’m using the standard definition of lead time here. The DORA metric, which is DevOps-focused, starts at commit, but I don’t think that’s particularly useful. A cycle time in ops of 0 or close to it doesn’t matter at all if there’s an upstream delay. There won’t be anything for ops to deploy in that case.
(2) Deployment frequency. An indirect measure of batch/story size. IME, it should be a couple of days to maybe a week on the outside. Every day, or every hour, would be better. Also, I think it would be better to call this “delivery-to-the-customer’s-hands frequency.” Deployment to Ops alone doesn’t work for me because the work is still in the “inventory” state. Work that is done but not delivered is a liability (Lean “inventory”). It’s money spent that is not generating any revenue.
(3) Percentage of failed deploys. This is officially the percentage of deployments that require a hotfix or rollback, but I also think of it as the percentage of delivered stories that do not solve your customer's/user's problems, so they require removal or tweaking. (This does not include deployment to a customer subset to get feedback.)
(4) Mean time to recovery when a failure occurs. This could be a failure in the software at runtime, but to me, delivering software that doesn't delight the customers is also a failure. Note that this metric is NOT the mean time between failures. It measures how fast you recover from a failure, not how often failures occur.
There are, of course, other metrics that are useful. For example, defect density tells us that we’re working much slower than necessary (working in “healthy” code increases productivity by as much as nine times)† Work in Progress (WIP) is also useful. A team working on only one thing at a time is 20% faster than one working on two things, and is 40% faster than a team working on three things.‡
The final and perhaps most important thing to point out is that these metrics measure the system of work as a whole. Typically, measuring the behavior of a single person, or single team, or any subset of the system for that matter, tells you nothing useful.
* If all of this is new to you, read Nichole Forsgren's "Accelerate" and also read pretty much everything on the DORA website.
† “Code Red: The Business Impact of Code Quality—A Quantitative Study of 39 Proprietary Production Codebases” [https://arxiv.org/pdf/2203.04374].
‡ Gerald Weinberg, “Quality Software Management: Systems Thinking (Volume 1)”
Remember that these metrics measure the health of a process but not the health of a product. Not a single one addresses customer outcomes, impact, retention, etc.
Are there precursor metrics for products that are not released? Pre-production metrics?