Vital Signs Reveal the Matrix

Vital Signs are a collection of signals designed by a team to monitor project and process health and to resolve problems quickly. This is as opposed to so-called, performative "data-driven decision making."

18 Sep 2023
Tags: Making Software Quality Visible, Test Pyramid, technical, testing

This seventeenth post in the Making Software Quality Visible series describes Vital Signs, team-selected signals monitoring the health of a project and its processes. They make quality work and its impact visible throughout the process, as opposed to common “data-driven decision making,” often used to justify avoiding meaningful action.

I’ll update the full Making Software Quality Visible presentation as this series progresses. Feel free to send me feedback, thoughts, or questions via email or by posting them on the LinkedIn announcement corresponding to this post.

Dedication

First things first, though: Today’s post is dedicated to Jimi Hendrix, who left us too soon on this day in 1970.

Sadly, I could find no decent images that were free for use to include in this post. But hopefully Jimi’s image is permanently etched in everyone’s mind’s eyes and ears by now.

Quality Work is like The Matrix

At the end of The Inverted Test Pyramid, I mentioned how we end up with too many larger tests and too few smaller tests:

Features prioritized over internal quality/tech debt

“Testing like a user would” is more important

Reliance on more tools, QA, or infrastructure (Arms Race)

Landing more, larger changes at once because “time”

Lack of exposure to good examples or effective advocates

We tend to focus on what we directly control—and what management cares about! (Groupthink)

And to this last point:

In such high stress situations, it’s human nature to focus on doing what seems directly within our control in order to cope. Alternatively, we tend to prioritize what our management cares about, since they have leverage over our livelihood and career development. It’s hard to break out of a bad situation when feeling cornered—and too easy to succumb to Groupthink without realizing it.

So how do we break out of this corner—or help others to do so?

To answer this, we have to overcome the fundamental challenge of helping people see what internal quality looks like, namely:

Quality work can be hard to see. It’s hard to value what can’t be seen—or to do much of anything about it.

We have to help developers, QA, managers, and executives care about it and to resist the Normalization of Deviance and Groupthink. We need to better show our quality work to help one another improve internal quality and break free from the Arms Race mindset.

In other words, internal quality work and its impact is a lot like The Matrix…

“Unfortunately, no one can be told what the Matrix is. You have to see it for yourself.”

—Morpheus, The Matrix

One way to start showing people The Matrix is to get buy-in on a set of…

Vital Signs

…“Vital Signs.” Vital Signs are a collection of signals designed by a team to reflect quality and productivity and to rapidly diagnose and resolve problems.

Intent

Comprehensive and make sense to the team and all stakeholders.
They should be comprehensive and make sense at a high level to everyone involved in the project, regardless of role.
Not merely metrics, goals, or data
We’re not collecting them for the sake of saying we collect them, or to hit a goal one time and declare victory.
Information for repeated evaluation
We’re collecting them because we need to evaluate and understand the state of our quality and productivity over time.
Inform decisions whether or not to act in response
These evaluations will inform decisions regarding how to maintain the health of the system at any moment.

Common elements

Some common signals include:

Pass/fail rate of continuous integration system
The tests should almost always pass, but failures should be meaningful and fixed immediately.
Size, build and running time, and stability of small/medium/large test suites
The faster and more stable the tests, the fewer resources they consume, and the more valuable they are.
Size of changes submitted for code review and review completion times
Individual changes should be relatively small, and thus easier and faster to review.
Code coverage from small to medium-small test suites
Each small-ish test should cover only a few functions or classes, but the overall coverage of the suite should be as high as possible.
Passing use cases covered by medium-large to large and manual test suites
For larger tests, we’re concerned about whether higher level contracts, use cases, or experience factors are clearly defined and satisfied before shipping.
Number of outstanding software defects and Mean Time to Resolve
Tracking outstanding bugs is a very common and important Vital Sign. If you want to take it to the next level, you can also begin to track the Mean Time to Resolve¹ these bugs. The lower the time, the healthier the system.

Most of these specific signals aim to reveal how slow or tight the feedback loops are throughout the development process. (Shoutout again to Simon Stewart and his Dopamine Driven Development presentation.) Even high code coverage from small tests implies that developers can make changes faster and more safely. Well scoped use cases can lead to more reliable, performant, and useful larger tests.

Other potentially meaningful signals

Some other potentially meaningful signals include…

Static analysis findings (e.g., complexity, nesting depth, function/class sizes)
Popular source control platforms, such as GitHub, can incorporate static analysis findings directly into code reviews as well. This encourages developers to address findings before they land in a static analysis platform report.²
Dependency fan-out
Dependencies contribute to system and test complexity, which contribute to build and test times. Cutting unnecessary dependencies and better managing necessary ones can yield immediate, substantial savings.
Power, performance, latency
These user experience signals aren’t caught by traditional automated tests that evaluate logical correctness, but are important to monitor.
Anything else the team finds useful for its purposes
As long as it’s a clear signal that’s meaningful to the team, include it in the Vital Signs portfolio.

Use them much like production telemetry

Treat Vital Signs like you would any production telemetry that you might already have.

Keep them current and make sure the team pays attention to them.
Clearly define acceptable levels—then achieve and maintain them.
Identify and respond to anomalies before urgent issues arise.
Encourage continuous improvement—to increase productivity and resilience.
Use them to tell the story of the system’s health and team culture.

Example usage: issues, potential causes (not exhaustive!)

Here are a few hypothetical examples of how Vital Signs can help your team identify and respond to issues.

Builds 100% passing, high unit test coverage, but high software defects
If your builds and code coverage are in good shape, but you’re still finding bugs…
- Maybe gaps in medium-to-large test coverage, poorly written unit tests
  …it could be that you need more larger tests. Or, it could be your unit tests aren’t as good as you think, executing code for coverage but not rigorously validating the results.
Low software defects, but schedule slipping anyway
If you don’t have many bugs, but productivity still seems to be dragging…
- Large changes, slow reviews, slow builds+tests, high dependency fan out
  …maybe people are still sending huge changes to one another for review. Or maybe your build and test times are too slow, possibly due to excess dependencies.
Good, stable, fast tests, few software defects, but poor app performance
Maybe builds and tests are fine, and there are few if any bugs, but the app isn’t passing performance benchmarks.
- Discover and optimize bottlenecks—easier with great testing already in place!
  In that case, your investment in quality practices has paid off! You can rigorously pursue optimizations, without the fear that you’ll unknowingly break behavior.

Getting started, one small step at a time

Here are a few guidelines for getting started collecting Vital Signs. First and foremost…

Don’t get hung up on having the perfect tool or automation first.
Do not get hung up on thinking you need special tools or automation at the beginning. You may need to put some kind of tool in place if you have no way to get a particular signal. But if you can, collect the information manually for now, instead of wasting time flying blind until someone else writes your dream tool.
Start small, collecting what you can with tools at hand, building up over time.
You also don’t need to collect everything right now. Start collecting what you can, and plan to collect more over time.
Focus on one goal at a time: lowest hanging fruit; biggest pain point; etc.
As for which Vital Signs to start with, that’s totally up to you and your team. You can start with the easiest signals, or the ones focused on your biggest pain points—it doesn’t matter. Decide on a priority and focus on that first.
Update a spreadsheet or table every week or so—manually, if necessary.
If you don’t have an automated collection and reporting system handy, then use a humble spreadsheet or wiki table. Spend a few minutes every week updating it.
Observe the early dynamics between the information and team practice.
Discuss these updates with your team, and see how it begins to shift the conversation—and the team’s behavior.
Then make a case for tool/infrastructure investment based on that evidence.
Once you’ve got evidence of the value of these signals, then you can justify and secure an investment in automation.

On measuring productivity

(This section is a footnote I’ve since added after writing this post.)

I’ve seen comments on LinkedIn recently alluding to severely misguided developer productivity measurement guidelines published by McKinsey. I don’t feel the need to go to the source or get embroiled in this controversy. However, it’s been well known to software practitioners for years that attempts to objectively measure software development productivity is fundamentally flawed. Whatever objective quantum of output anyone tries to measure, people will immediately work to game that metric to their benefit.

Consequently, when I speak about using Vital Signs to reflect productivity, I’m not talking about measuring developer productivity directly. The suite of Vital Signs can help a team ensure all the forces acting on the project and their observable consequences are in balance. Often these consequences aren’t in terms of direct output, but in terms of drag: bug counts; build and test running times and failure rates; etc. Once a team has landed on a good set of signals, any imbalance implies a potential negative impact on productivity. No one metric should dominate, making the entire suite practically impossible to game.

Also, by encouraging the team and any project managers and executive sponsors to design the suite together, Vital Signs aim to balance everyone’s practical concerns. Everyone has a legitimate perspective and set of incentives, but everyone needs to communicate clearly with one another to ensure proper team alignment and performance. When members of any particular role try unilaterally to impose measurements on the others, that adversarial relationship will produce exploitable gaps, producing failure and suffering.

A concrete example

(This is a continuation of the footnote from the previous section.)

As a concrete example, I can share some of my experience working on a web search team at Google. Our manager maintained a Google Sheets document with several sheets that we reviewed with our Site Reliability Engineering partners during our weekly status meetings. Every week followed roughly this pattern:

Review all production alerts and incidents. Decide on what to do about each one, e.g., tune the alert, provision more CPU/RAM/disk resources or service instances, update code.
Review last deployment of every system microservice, and determine if any need to be recompiled and rereleased to keep them current.
Review list of features under development. Identify which are complete, which are still under development, and what to do if any are blocked.
Propose new features or maintenance changes to plan for the upcoming week(s).

Each one of these items would either have or receive at least one owner. Many times an SRE member and a development team member would share ownership of items, particularly those to resolve or prevent production issues.

Though there wasn’t an executive sponsor directly involved per se, we were all working within the Google-wide Objectives and Key Results framework. All of our provisioning, reliability, and feature delivery tasks were chosen with our OKRs in mind. We didn’t explicitly discuss build and test issues in these meetings, either, because we were all constantly very responsive to our continuous build. Plus, everyone took turns on “operator duty,” during which one was production “operator” and “build cop” for the week. The operator responded to any production incidents in partnership with the designated SRE member for that week. They also ensured any continuous build failures or other issues were resolved ASAP, via rollbacks or other fixes.

The key point is that everyone was involved in discussing and deciding upon every issue, feature, and task. Communication was open and constant, and responsibility and decision making was widely shared, while ensuring each task had an owner. There was no hiding or gaming any specific metrics, because everything was perfectly visible and everyone was bought into ensuring constant balance.

On code coverage as a signal, not a goal

(This section appears as a footnote in the original.)

I’m often asked if teams should always achieve 100% code coverage. My response is that it one should strive for the highest code coverage possible. This could possibly be 100%, but I wouldn’t worry about going to extreme lengths to get it. It’s better to achieve and maintain 80% or 90% coverage than to spend disproportionate effort to cover the last 10% or 20%.

That said, it’s important to stop looking at code coverage as merely a goal—use it as a signal that conveys important information. Code coverage doesn’t show how well tested the code is, but how much of the code isn’t exercised by small(-ish) tests at all.

So it’s important to understand clearly what makes that last 10% or 20% difficult or impractical to cover—and to decide what to do about it. Is it dead code? Or is it a symptom of poor design—and is refactoring called for? Is there a significant risk to leaving that code uncovered? If not, why keep it?

Another benefit to maintaining high coverage is that it enables continuous refactoring. The Individual skill acquisition section expands on this.

On not waiting for automation before getting started

(This section appears as a footnote in the original.)

I can’t remember where I got the idea, but it’s arguably better to develop a process manually before automating it. In this way, you carefully identify the value in the process, and which parts of it would most benefit from automation. If you start with automation, you’re not starting from experience, and people may resent having to use tools that don’t fit their actual needs. This applies whether you’re building or buying automation tools and infrastructure.

Of course, if you have past experience and existing, available tools, you can hit the ground running more quickly. The point is that it’s wasteful to wait for automation to appear when you could benefit from a process improvement now, even if it’s manual.

On so-called, performative "Data-Driven Decision Making"

(This section appears as a footnote in the original.)

I’ve called this concept of collecting signals to inform decision making “Vital Signs” because I believe “data-driven decision making” has lost its meaning. As often happens with initially useful innovations, the term “data-driven decision making” has become a buzzword. It’s a sad consequence of a misquote of W. Edwards Deming, an early pioneer of data-driven decision making, who actually said:

“It is wrong to suppose that if you can’t measure it, you can’t manage it—a costly myth.”

—The New Economics, Chapter 2, “The Heavy Losses”

Over time, this got perverted to “If you can’t measure it, you can’t manage it.” (The perversion is likely because people know him as a data advocate, and are ignorant of the subtlety of his views.)

Many who vocally embrace data-driven decision making today tend to put on a performance rather than apply the principle in good faith. They tend to want to let the data do the deciding for them, absolving them of professional responsibility to thoughtfully evaluate opportunities and risks. It’s a ubiquitously accepted Cover Your Ass rationale, a shield offering protection from the expectation of ever having to take any meaningful action at all. It also a hammer used to beat down those who would take such action—especially new, experimental action lacking up front evidence of its value. Even so, often “the data shows” that we should do nothing, or do something stupid or unethical. This holds even when other salient, if less quantifiable signals urge action, or a different course of action.

My favorite metaphor for the CYA function of “data-driven decision making” is “Watermelon Status,” a report that’s green on the outside, red on the inside. (A former colleague, who I believe would wish to remain anonymous, introduced me to this concept.) This is a phenomenon whereby people close to the actual project work report a “red” status, signifying trouble. However, layers of management edit and massage the “data” such that the status appears “green” to higher level management, signifying all is well. That’s what the “decision makers” want to hear, after all.

As such, allegiance to “data-driven decision making” tends to encourage Groupthink and to produce obstacles to meaningful change. On the contrary, “Vital Signs” evokes a sense of care for a living system, and a sense of commitment to ensuring its continued health. It implies we can’t check a box to say we’ve collected the data and can take system quality and health for granted. We have to keep an eye on our system’s Vital Signs, and maintain responsibility for responding to them as required.

My visceral reaction arises from all the experiences I’ve had (using Crossing the Chasm terminology) with Late Majority members lacking courage and Laggards resisting change. I’ll grant that the Late Majority may err on the side of caution, and once they’re won over, they can become a force for good. But Laggards feel threatened by new ideas and try to use data, or the lack thereof, as a weapon. Then when you do produce data and other evidence, they want to move the goalposts.

The Early Majority is a different story altogether. I’ve had great experiences with Early Majority members who were willing to try a new approach to testing and quality, expecting to see results later. Once we made those results visible, it justified further investment. This is why it’s important to find and connect with the Early Majority first, and worry about the Late Majority later—and the Laggards never, really.

Coming up next

The next post in Making Software Quality Visible series will introduce the next major section, What Software Quality Is and Why It Matters. We’ll define internal software quality and explain why it’s just as essential as external.

What Software Quality Is and Why It Matters

I’m going to take a couple weeks off before making that next post, though. If I post anything in the meanwhile, it won’t be software quality related, but related to some electric guitar hacking I’ve gotten into. But as always, if you want to get into the upcoming material today, feel free to check out the main Making Software Quality Visible presentation.

Footnotes

As the linked page explains, the “R” in “MTTR” can also stand for “Repair,” “Recovery,” or “Respond.” However, I like to suggest “Resolve,” because it includes response, repair, recovery, and a full follow through to understand the issue and prevent its recurrence. ↩
SonarQube is a popular static analysis platform, but I’m partial to Teamscale, as I happen to know several of the CQSE developers who own it. They’re really great at what they do, and are all around great people. They provide hands-on coaching and support to ensure customers are successful with the system, which they’re constantly improving based on feedback. I’ve seen them in action, and they deeply understand that it’s the tool’s job to provide insight that facilitates ongoing conversations.

(No, they’re not paying me to advertise. I just really like the product and the people behind it.)

I also like to half-jokingly say Teamscale is like an automated version of me doing your code review—except it scales way better. The more the tool automatically points out code smells and suggests where to refactor, the more efficient and more effective code reviews become. ↩