Making Software Quality Visible

10 Jan 2023

This presentation, Making Software Quality Visible, is my calling card. It describes my approach to driving software quality improvement efforts, especially automated testing, throughout an organization—applying systems thinking and leadership to influence its culture.

If you’d be interested in my help, please review the slides and script below. If my approach aligns with your sense of what your organization needs, please reach out to me at mbland@acm.org. (Also see my Hire me! page for more information.)

Note as of 2023-09-13: I’m currently updating the official version to include new changes as I continue writing the Making Software Quality Visible blog series. The Google Drive copies will remain out of sync with the official version until I’ve finished, at which point I’ll remove this notice.

Slides

Original slides:

Official version: Making Software Quality Visible Keynote presentation
Keynote and PDF copies: Google Drive folder

Custom slides:

Making Software Quality Visible: 2023 DevOps Enterprise Forum

Abstract
Introduction
Agenda
Formative Experiences at Northrop Grumman, Google, and Apple
Skills Development, Alignment, and Visibility
The Test Pyramid and Vital Signs
What Software Quality Is and Why It Matters
Why Software Quality Is Often Unappreciated and Sacrificed
Building a Software Quality Culture
Calls to Action
Acknowledgments
History
TODOs
Footnotes

Abstract

We’ll discuss why internal software quality matters, why it’s often unappreciated and sacrificed, and what we can do to improve it. More to the point, we’ll discuss the importance of instilling a quality culture to promote the proper mindset first. Only on this foundation will seeking better processes, better tools, better metrics, or AI-generated test cases yield the outcomes we can live with.

Introduction

I’m Mike Bland, I’m a programmer, and I’m going to talk about how making software quality visible…

Software Quality must be visible to minimize suffering.

…will minimize suffering.¹ By “suffering,” I mean: The common experience of software that’s painful—or even dangerous—to work on or to use.

By “software quality,” I mean: Confidence in the software’s behavior and user experience based on information and understanding. This is opposed to feeling anxious or overconfident about how the software will behave in the absence of information.

The way we produce accessible and useful information that enables understanding is by managing complexity effectively. If the information we need is obscured by complexity, we make bad assumptions and bad choices and suffer as a result. As we’ll see, software complexity is rooted not just in the code itself, but in our culture, which shapes our expectations, communications, and behaviors.²

Finally, by “making software quality visible,” I mean: Providing meaningful insight into quality outcomes and the work necessary to achieve them.

Quality work can be hard to see. It’s hard to value what can’t be seen—or to do much of anything about it.

This is important because it’s often difficult to show quality work, or its impact on processes or products.³ How do we show the value of avoiding problems that didn’t happen? This makes it difficult to prioritize or justify investments in quality, since people rarely value work or results they can’t see. Plus, people can’t effectively solve problems they can’t accurately sense or understand.

Agenda

There’s a big story behind how I arrived at these conclusions, and what I’ve learned about leading others to embrace them as well.

Formative Experiences at Northrop Grumman, Google, and Apple
I’ll share examples of making quality work visible to minimize suffering from my experiences at Northrop Grumman, Google, and Apple.
Skills Development, Alignment, and Visibility
I’ll share some ideas for cultivating individual skill development, team and organizational alignment, and visibility of quality work and its results.
The Test Pyramid and Vital Signs
We’ll use the Test Pyramid model to specify the principles underlying a sound testing strategy. We’ll also discuss the negative effects that its opposite, the Inverted Test Pyramid, imposes upon its unwitting victims. I’ll then describe how to use “Vital Signs” to get a holistic view on software quality for a particular project.
What Software Quality Is and Why It Matters
We’ll define internal software quality and explain why it’s just as essential as external.
Why Software Quality Is Often Unappreciated and Sacrificed
We’ll examine several psychological and cultural factors that are detrimental to software quality.
Building a Software Quality Culture
We’ll learn how to integrate the Quality Mindset into organizational culture, through individual skill development, team alignment, and making quality work visible.
Calls to Action
We’ll finish with specific actions everyone can take to lead themselves and others towards making software quality work and its impact visible.

Formative Experiences at Northrop Grumman, Google, and Apple

The story begins with the fact that programming wasn’t my original plan.

How it started…

From the beginning, I’ve always loved music—so much so that my first attempt at college led me to Berklee College of Music in Boston. It wasn’t just the beauty of the music that was attractive, or what I then believed to be the glamorous lifestyle. It was the idea of being part of a band, part of a team, pushing boundaries and achieving great things together.

Rockin’ my Mom’s shag-tastic living room with a little help from Elvis, 1974; and Berklee College of Music, 1991.
Click either for a larger image.

Unfortunately—or fortunately?—the rock star plan didn’t work out, and I fell back on my second love: programming.⁴

Northrop Grumman Mission Systems

Navigation for US Coast Guard vessels and US Navy submarines

My first job as a software developer was working on US Coast Guard and US Navy nuclear submarine navigation systems at Northrop Grumman Mission Systems.

Digital Nautical Chart showing Fort Monroe and the north entrance to the Hampton Roads Bridge—Tunnel in Hampton, Virginia. Adapted from the original linked image from the Digital Nautical Chart® page of the National Geospatial-Intelligence Agency. Note that there are new National Geospatial-Intelligence and Digital Nautical Chart web sites, but I couldn’t find the original artifacts on either site.

Coast Guardsmen aboard U.S. Coast Guard Cutter Monomoy (WPB 1326) and U.S. Navy Los Angeles-class submarine, USS San Juan (SSN-751). I don’t know if software I wrote ever ran on these specific ships, but it certainly ran on ships like them. Click either for the original image.

My “Library”

My colleagues at the time poked fun at me for carrying my “library” of programming and algorithms books with me everyday. I was learning what good code looked like by being able to see and understand what it looked like.

Each image shows the edition of the book I used at the time. However, each image links to the most recent edition available at the time of writing. Note that the “Exceptional C++” link is not HTTPS. Original sources for each image: Effective C++; The C++ Programming Language; Modern C++ Design; Introduction to Algorithms; Exceptional C++; Design Patterns.

Then, my team went through a “death march.”

Death March

Experiencing “suffering” in software

I actually never read this book, but I thought showing it here would underscore how common the software death march experience really is. Original image source: Death March cover from Amazon.

A “death march” is a crushing experience of working insane hours to deliver on impossible requirements by an impossible deadline. This usually involves changing a fragile, untested code base that barely works to begin with, which was the case for our project. But somehow, we did it. We met the spec by the deadline, and then were given the freedom to do whatever we could to make the damn thing faster.

During this period of relative calm, I began to learn a few things.

Expectations: Requirements + Assumptions

The reason why poor quality can lead to suffering

I began to learn the relationship between code quality and expectations, which is the sum of requirements and unwritten assumptions.⁵

Requirement	Enumerate chart features
Assumption	In memory size == on disk size
Reality	21 bytes on disk, 24 in memory
Outcome	File size/24 == 12.5% data loss
Impact	Caught before shipping!

Requirement: One day our product owner sent us some code to enumerate nautical chart features from a file.

Assumption: The code assumed each record was the same size in memory as it was on disk.

Reality: However, the records were 21 bytes on disk, but the in memory structs were 24 bytes, thanks to byte padding.

Outcome: As a result, this code ignored one eighth of the chart features in the file. The product owner’s lack of complete understanding led to overconfidence that masked a potentially severe quality issue.

Impact: Fortunately I caught this before it shipped to any nuclear submarines.

Not long after that…

Discovering unit testing—by accident

Curiosity and serendipity, not training or culture

I discovered unit testing, completely by accident, from Koss and Langr’s C/C++ Users Journal article “Test Driven Development in C/C++”.

With my library and my newfound testing practice in hand, I rewrote an entire subsystem from the ground up. I’d apply a design principle, or an algorithm, and exercise it with focused, fast, thorough tests every step of the way, rather than waiting for system integration. I could immediately see the results to validate that each part worked as planned, and spend more time developing than debugging.

CppUnit test runner (from Cpp Unit, Windows Edition). I actually used the QT GUI on Solaris at the time, but couldn’t find an image of it. Click for the original image.

In the end, my new subsystem improved performance by a factor of 18x and saved the project. And when a bug or two cropped up, I could investigate, reproduce, fix, validate, and ship a new version the same day—not the next week, or the next month.

I wish I could say I was well trained or mentored, or that the culture encouraged my development. The truth is I was entirely self-motivated. Although my teammates recognized my abilities, they never adopted my practices, despite their obvious impact. I couldn’t figure out why.

So I put my house on the market, quit my job, and met a woman on an online dating site whose referral led to me joining Google in 2005. In that order.

Google: Testing Grouplet

2005-2010, mike-bland.com/the-rainbow-of-death

At Google I joined the Testing Grouplet, a team of volunteers dedicated to driving automated testing adoption. My talk “The Rainbow of Death” tells the five year story of the Grouplet. I’ll give you the fast forwarded version of that talk right meow.⁶

Rapid growth, hiring the “best of the best,” build/test tools not scaling
When I joined in 2005, the company was growing fast,⁷ and we knew we were “the best of the best.” However, our build and testing tools and infrastructure weren’t keeping up.
Lack of widespread, effective automated testing and continuous integration; frequent broken builds and “emergency pushes” (deployments)
Developers weren’t writing nearly enough automated tests, the ones they wrote weren’t that good, and few projects used continuous integration. As a result, code frequently failed to compile, and errors that made it to production would frequently lead to “emergency pushes,” or deployments.
Resistance: “I don’t have time to test,” “My code is too hard to test.”
We kept hearing that people didn’t have time to test, or that their code was too hard to test.
Imposter syndrome, deadline pressure
Underneath that, we realized many suffered from imposter syndrome, while under intense deadline pressure.
(Mostly) smart people who hadn’t seen a different way
Basically, these were (mostly) smart people who didn’t know what they didn’t know, and couldn’t afford to stop and learn everything at once.

We had to identify how to get the right knowledge and practices to spread over time.

Geoffrey A. Moore, Crossing the Chasm, 3rd Edition

Different people embrace change at different times

The “Crossing the Chasm” model from Geoffrey Moore’s book of the same name helps to make sense of our dilemma.⁸ At a high level, it illustrates how different segments of a population respond to a particular innovation.

Innovators and Early Adopters are like-minded seekers, enthusiasts and visionaries who together bring an innovation to the market and lead people to adopt it. I like to lump them together and call them Instigators.
The Early Majority are pragmatists who are open to the new innovation, but require that it be accessible and ready to use before adopting it.
The Late Majority are followers waiting to see whether or not the innovation works for the Early Majority before adopting it.
Laggards are the resisters who feel threatened by the innovation in some way and complain about it the most. They may potentially raise valid concerns, but often they only bluster to rationalize sticking with the status quo.⁹

The Instigators face the challenge of bringing an innovation across The Chasm separating them from the Early Majority, developing what Moore calls The Total Product. Developing the Total Product requires that the Instigators identify and fulfill several needs the Early Majority has in order to facilitate adoption.

As Instigators, the Testing Grouplet focused its energy on connecting with other Instigators and the early Early Majority to deliver the Total Product. We largely ignored the highly vocal Laggards.¹⁰

The Rainbow of Death

mike-bland.com/the-rainbow-of-death

This connection across the chasm isn’t part of the original Chasm model, but one I borrowed from a friend¹¹ and called “The Rainbow of Death.” It helps illustrate those Early Majority needs the Instigators must satisfy. Doing so transforms the Early Majority from being dependent on the Instigators’ expertise into independent experts themselves.

Five years of chaos…

…and one Rainbow to rule them all

I’ll now use the Rainbow of Death to show how the Testing Grouplet eventually brought that Total Product across the Chasm.

[Note that the following steps are animated in the actual presentation, filling in the Rainbow of Death graphic one block at a time.]

Intervene + Empower: Of course there were already teams working to empower developers by improving development tools and infrastructure. But as you can see, there’s still a large gap between delivering tools and helping people use them well.
Mentor: That’s where the Testing Grouplet stepped in.
Inform: We started by training new hires, writing “Codelab” online training modules, hosting Tech Talks, and giving out tons of free books. But we noticed people weren’t necessarily reading those books, or otherwise applying the knowledge we were sharing. So we transformed from a “book club” to an “activist group.”¹²
Inspire: We shared the Google Web Server success story…
Validate: …then distilled that experience into the Test Certified roadmap program. This program was comprised of three levels containing several tasks each. It removed friction and pressure for teams by providing a starting point and path for focusing on one improvement at a time.
Mentor: We also offered volunteer “Mentors” to guide teams through the process and celebrate success…
Inspire: …and physical, glowing “build orbs” to monitor their build status.
Intervene: We eventually built the Test Mercenaries internal consulting team to work with more challenging projects on climbing the “TC Ladder.”
Inform: And our biggest hit was our Testing on the Toilet newsletter, appearing weekly in every company bathroom.
Inspire: We eventually focused on getting every team to operate at Test Certified Level Three, whether they were officially enrolled or not.
Inspire: All of this was punctuated by four “Fixits,” companywide events to address “important but not urgent” issues. Our Fixits inspired people to write and fix tests…
Empower: …to adopt new build tools,¹³ and finally…
Empower: …to adopt the Test Automation Platform continuous integration system.

These efforts made quality work and its impact more visible than it had been. This helped people write better tests, adopt better testing practices and strategies, drastically improve build and test times, reduce bugs, and increase productivity. But perhaps the most visible result was scalability of the organization.

Google: Testing Grouplet results

2015, R. Potvin, Why Google Stores Bills. of LoC in a Single Repo

Rachel Potvin presented the following results in her presentation from @Scale 2015, “Why Google Stores Billions of Lines of Code in a Single Repository.” They may seem quaint to Googlers today, but they speak to the Testing Grouplet’s enduring impact five years after the TAP Fixit.

15 million LoC in 250K files changed by humans per week
15K commits by humans, 30K commits by automated systems per day
800K/second peak file requests

Of course, the Testing Grouplet isn’t responsible for all of this; Rachel’s talk describes an entire ecosystem of tools and practices. Even so, she states very clearly that:

“TAP is our automated test infrastructure, without which this model would completely fall apart.” (13m:36s)

Also, it may amuse you to know that Testing on the Toilet, started in 2006, continues to this day!¹⁴

original image source
This isn’t a recent episode—it’s from around 2007, as I recall—but it’s from my Testing on the Toilet blog post. I also just happened write this one, which touches on Test Certified and the Test Mercenaries as well.
Click for a larger image.

One more time…

After the Testing Grouplet succeeded, I worked on websearch indexing for a couple of years. Then I burned out, dropped out, and tried the music thing again. It obviously didn’t really work out, or I wouldn’t be speaking to you now.

Berklee College of Music, 2013.
Click for a larger image.

Apple’s goto fail

Finding More Than One Worm in the Apple, CACM, July 2014

The beginning of my descent back into the tech industry began in February 2014, thanks to Apple’s famous “goto fail” bug.

Requirement	Apply algorithm multiple times
Assumption	Short algorithms safe to copy
Reality	Copies may not stay identical
Outcome	One of six copies had a bug
Impact	Billions of devices

Requirement: Apple had to update part of its open source Secure Transport component which applied the same algorithm in six places.

Assumption: The developers apparently assumed that this short, ten line algorithm was safe to copy in its entirety, instead of making it a function.

Reality: One problem with duplication is that the copies may not remain identical.

Outcome: As it so happened, one of the six copies of this algorithm picked up an extra “goto” statement that short circuited a security handshake.

Impact: Once it was discovered and patched, Apple had to push an emergency update to billions of devices. It’s unknown whether it was ever exploited.

The complexity produced by copying and pasting nearly-but-not-quite-identical code yielded poor quality that masked a horrific defect. My article “Finding More Than One Worm in the Apple” explains how this bug could’ve been caught, or prevented, by a unit test.

OpenSSL’s Heartbleed

Goto Fail, Heartbleed, and Unit Testing Culture, May 2014

Requirement	Echo message from request
Assumption	User-supplied length is valid
Reality	Actual message may be empty
Outcome	Server returns arbitrary data
Impact	Countless HTTPS servers

Requirement: In April 2014, OpenSSL had to update its “heartbeat” feature, which echoed a message supplied by a user request.

Assumption: The code assumed that the user supplied message length matched the actual message length.

Reality: In fact, the message could be completely empty.

Outcome: In that case, the server would hand back however many bytes of its own memory that the user requested, including secret key data.

Impact: Countless HTTPS servers had to be patched. It’s unknown whether it was ever exploited.

My article “Goto Fail, Heartbleed, and Unit Testing Culture” explains how this bug could’ve been caught, or prevented, by a unit test. It describes how the absence of a rigorous testing culture allowed a fundamentally flawed assumption to endanger the privacy and safety of millions. It also shows how to challenge such fundamental assumptions and to prevent them from compromising complex systems through unit testing specifically.

Apple: Quality Culture Initiative

2018-present

Shortly after that, I was lured back into technology and eventually ended up at Apple in November 2018, which I left in November 2022. At Apple, I joined forces with a few others¹⁵ to start the Quality Culture Initiative, another volunteer group inspired by the Testing Grouplet.

Rapid growth, hiring the “best of the best,” build/test tools not scaling
When I joined Apple in 2018, the company was growing fast, and we knew we were “the best of the best.” However, our build and testing tools and infrastructure weren’t keeping up.
Widespread automated and manual testing, but…
There was a strong testing culture, but not around unit testing.
“Testing like a user would” often considered most important
With so much emphasis on the end user experience, many believed that “testing like a user would” was the most important kind of testing.
Tests often large, UI-driven, expensive, slow, flaky, and ineffective
As a result, most tests were user interface driven, requiring full application and system builds on real devices. Since writing smaller tests wasn’t often considered, this led to a proliferation of large, expensive, slow, unreliable, and ineffective tests, generating waste and risk.
“We’re the best” syndrome, deadline pressure
Rather than imposter syndrome, there was strong sense that we were already the best.
(Mostly) smart people who hadn’t seen a different way
This led to a lot of (mostly) smart people suffering because not enough of them even knew that better methods of improving quality existed.

The End of the Rainbow

Too much of a good thing, way too soon

In the beginning, I made the mistake of thinking the Rainbow of Death could help us accelerate adoption. I kept trying to use it as an answer key. I’d expect these “smart people” to “get it,” shortcut the exploration phase, get straight to implementation, and shave years off the process. However, I eventually realized that it’s too complicated a device to apply at the beginning of the change process.¹⁶

Instigating Culture Change

Essential needs an internal community must support

So instead, we focused on these essential needs to simplify our initial efforts:¹⁷

Individual Skill Development
Team/Organizational Alignment
Quality Work/Results Visibility

Each part of the cycle gains momentum from the others,¹⁸ but it’s important to focus on completing one effort before launching the next.

Focus and Simplify

Build a solid foundation before launching multiple programs

Looking back, this is how the Testing Grouplet built up its efforts one step at a time, over time. We did try many things, some in parallel, but we tended to establish one major program at a time before focusing on establishing another.¹⁹

At Apple, I started off trying to use the Rainbow of Death to get too many projects started at once. We didn’t make much progress for about a year.²⁰ Once I realized my mistake and confessed it to the Quality Culture Initiative, everyone agreed that we needed to focus and simplify our efforts.

Skill Development: Complete training curriculum and volunteer training staff
First, we launched a complete training curriculum with an all-volunteer training staff.
Alignment/Visibility: Internal podcast focused on producing regular episodes
Our internal podcast team then got serious about publishing episodes more regularly.
Alignment/Visibility: Quality Quest in one org, then spread to others via QCI
While I was focused on those programs, another core QCI member established the QCI’s version of Test Certified, Quality Quest, in his organization. We then merged Quality Quest back into the QCI mainstream, allowing it to spread to other organizations.

After that, we began experimenting again with other projects, some sticking, some not so much. Whenever a project seemed to stall, we’d invoke our “focus and simplify” mantra and pour that focus into more productive areas.

Apple: Quality Culture Initiative results

QCI activity as of November 2022—internal results confidential

It’s too early for the QCI to declare victory, and specific results to date are confidential. However, I can broadly describe the state of the QCI’s efforts by the time I left Apple in November 2022.

Training: 16 courses, ~40 volunteer trainers, ~360 sessions, ~6100 check-ins, ~3200 unique individuals
Our training program was wildly successful, with sixteen courses and dozens of volunteer trainers helping thousands of attendees improve their coding and testing capabilities.
Internal podcast: 45 episodes and 500+ subscribers
Our podcast series gave a voice to people of various roles from various organizations, helping drive a rich software quality conversation across Apple.
Quality Quest roadmap: ~80 teams, ~20 volunteer guides
Our Quality Quest roadmap, directly inspired by Test Certified, started helping teams across Apple improve their quality practices and outcomes.
QCI Ambassadors: 6 organizations started, 6 on the way
QCI Ambassadors now use these resources to help their organizations apply QCI principles, practices, and programs to achieve their quality goals.
QCI Roadshow: over 50 presentations
The QCI Roadshow helps introduce QCI concepts and programs directly to groups across the company.
QCI Summit: ~50 recorded sessions, ~60 presenters, ~850 virtual attendees
Our QCI Summit event recruited presenters from across Apple to make their quality work and impact visible. We saw how QCI principles applied to operating systems and services, applications, frontends and backends, machine learning, internal IT, and development infrastructure.

What’s in a name?

What we realized three years after choosing it

One nice feature about the name “Quality Culture Initiative” that we didn’t realize for three years was how it encoded the total Software Quality solution:

Quality is the outcome we’re working to achieve, but as I’ll explain, achieving lasting improvements requires influencing the…
Culture. Culture, however, is the result of complex interactions between individuals over time. Any effective attempt at influencing culture rests upon systems thinking, followed by taking…
Initiative to secure widespread buy in for systemic changes. Selling a vision for systemic improvement and supporting people in pursuit of that vision requires leadership.

There’s a lot to unpack when it comes to leading a culture change to make software quality visible.

If you end up itching for more information than I can provide during our time today…

For a lot more detail:
mike-bland.com/making-software-quality-visible

…follow this link to view the slides and script. Both have extensive footnotes that go into a lot more detail.

Skills Development, Alignment, and Visibility

The objective of individual skill development, team and organizational alignment, and visibility of quality work and its results is to help everyone make better choices.

Quality begins with individual choices.

Quality begins with the choices each of us make as individuals throughout our day.

Awareness of principles and practices improves our choices.

Awareness of sound quality principles and practices improves the quality of these choices.

Common language makes principles and practices visible—improving everyone’s choices.

Developing a common language makes these principles and practices visible, so we can show them to one another, helping raise everyone’s quality game.

Individual Skill Acquisition

Help individuals incorporate principles, practices, language

Therefore, helping individuals acquire new knowledge, language, and skills is essential to improving software quality.

Training, documentation, internal media, mentorship, sharing examples
We can offer training, documentation, and other internal media to spread awareness. We can also offer direct mentorship or share examples from our own experience to help others learn.

While QA testers, project managers, executives, and other stakeholders should be included in this process, my specialty is focusing on developers. Straightforward programming errors comprise the vast majority of quality issues; they’re also generally the most preventable and easily fixable. Helping developers write high quality code and tests is the fastest, cheapest, and most sustainable way to catch, resolve, and even prevent most programming errors.

Here I’ll summarize a few key principles and practices to help developers do exactly that.

Testable code/architecture is maintainable—tests add design pressure, enable continuous refactoring; use code coverage as a tool, not a goal
Designing code for testability, given proper guidance on principles and techniques, adds design pressure that yields higher quality code in general. Having good tests then enables constant improvements to code quality through continuous refactoring, instead of stopping the world for complex, risky overhauls or rewrites.²¹

Good tests also enable developers to use code coverage as a tool while refactoring, helping ensure new and improved code replaces the previous code.²²
Stop copy/pasting code; send several small code reviews, not one big one
Two common habits that contribute to worse code quality are duplicating code²³ and submitting large changes for review.²⁴ These changes make code difficult to read, test, review, and understand, which hides bugs and makes them difficult to find and fix after they’ve shipped. Helping people write testable code also helps people break these costly bad habits.
Tests should be designed to fail: naming and organization can clarify intent and cause of failure; use Arrange-Act-Assert (a.k.a. Given-When-Then)
The goal of testing isn’t to make sure tests always pass no matter what. The goal is to write tests that let us know, reliably and accurately, when our expectations of the code’s behavior differ from reality. Therefore, we should apply as much care to the design, naming, organization of our tests as we do to our production code.²⁵

Merely showing people the immediately graspable Arrange-Act-Assert (or Given-When-Then) pattern can be a profound revelation that changes their perspective forever.
Interfaces/seams enable composition, dependency breaking w/ test doubles
Of course, many of us start out in legacy code bases with few tests, if any.²⁶ So we also need to teach how to make safe changes to existing code that enable us to begin improving code quality and adding tests. Michael Feathers’s Working Effectively with Legacy Code is the seminal tome on this subject, showing how to gently break dependencies to introduce seams. “Seams” are points at which we introduce abstract interfaces that enable test doubles to stand in for our dependencies, making tests faster and more reliable.²⁷

Speaking of interfaces, Scott Meyers, of Effective C++ fame, gave perhaps the best design advice of all for writing testable, maintainable, understandable code in general:

“Make interfaces easy to use correctly and hard to use incorrectly.”

—Scott Meyers, The Most Important Design Guideline?

To propose a slight update to make it more concrete:

“Make interfaces easy to use correctly and hard to use incorrectly—like an electrical outlet.”

—With apologies to Scott Meyers, The Most Important Design Guideline?

Image of a 120V 15A electrical outlet representing what Scott Meyers calls
"The Most Important Design Guideline": Make interfaces easy to use correctly and
hard to use incorrectly. I took this picture of an electrical outlet myself, because I couldn’t find a

Of course, it’s not impossible to misuse an electrical outlet, but it’s a common, wildly successful example that people use correctly most of the time.²⁸ Making software that easy to use or change correctly and as hard to do so incorrectly may not always be possible—but we can always try.

Team/Organizational Alignment

Get everyone speaking the same language

Living up to that standard is a lot easier when the people you work with also consider it a priority.²⁹ Here’s how to create the cultural space necessary for people to apply successfully the new insights, skills, and language we’ve discussed.

Internal media, roadmap programs, presentations, advocates
Use internal media like blogs and newsletters to start a conversation around software quality and to start developing a common language. Roadmap programs create a framework for that conversation by outlining specific improvements teams can adopt. Team and organizational presentations can rely on the quality language and roadmap to inspire an audience and make the concepts more memorable. Software quality advocates can then use all these mechanisms to drive progress.
Understanding between devs, QA, project managers/owners, executives
Articulate how everyone plays a role in improving software quality, and get them all communicating with one another! An executive sponsor or project manager may not need to understand the fine details of dependency injection and test doubles. However, if they understand the Test Pyramid, they can hold developers and QA accountable for improving quality by implementing a balanced, reliable, efficient testing strategy.
Focus and simplify! (Don’t swallow the elephant—leave some for later!)
This is a lot of work that will take a long time. Rather than getting overwhelmed or spreading oneself too thin trying to swallow the entire elephant, focus and simplify by delivering one piece at a time.
Every earlier success lays a foundation and creates space for future effort
Delivering the first piece creates more space to deliver the second piece, then the third, and so on.
Be agile—make plans, but recalibrate often
Of course, this process need not be strictly linear. It’s important to be clear about priorities and delivering pieces over time, but make adjustments as everyone gains experience and the conversation unfolds.³⁰
Absorb influences like a musician/band—then create your own voice/style
Ultimately the process is a lot like helping one another grow as a musicians. It’s not about everyone doing exactly as they’re told. Everyone should be absorbing ideas, trying them out, gaining experience with them, and ultimately making them their own, part of their individual style. Then everyone can share what they’ve learned and discovered, enriching all of us further by adding their own voice to the ongoing conversation.

Roadmap Programs

Guidelines, language, conversation framework, examples

Let’s examine the value of roadmap programs more closely. They can provide the language and conversational framework for the entire program, as well as other powerful features.

Define beginning, middle, and end—break mental barrier of where to start
One of the most important features a roadmap provides is helping teams focus on getting started. It can help overcome the mental barrier of not knowing where to begin by helping to visualize the beginning, middle, and end of the journey. For this reason, I recommend organizing roadmaps into three phases, or “levels,” with four or five steps each. This helps break “analysis paralysis” by narrowing the options at each stage, and providing a rough order in which to implement them.
Align dev, QA, project management, management, executives
A roadmap can help produce alignment across the various stakeholders in a project, making clear what will be done, by who, and in what order. Shared language and collective visibility encourage common understanding, open communication, and accountability.
Recommend common solutions, but don’t force a prescription
A good roadmap won’t force specific solutions on every team, but will provide clear guidelines and concrete recommendations. Teams are free to find their own way to satisfy roadmap requirements, but most teams stand to benefit from recommendations based on others’ experience.
Give space for conversation and experience to shape the way
The point of a roadmap isn’t to guarantee a certain outcome, or to constrain variations or growth. It’s to help teams communicate their intentions, coordinate their efforts, and adjust as necessary based on what they learn together throughout the process.
Provide a framework to make effort and results visible, including Vital Signs
A roadmap helps teams focus, align, communicate, learn, and accomplish shared goals by making software quality work and its impact visible. It helps people talk about quality by giving it a local habitation and a name. Then it provides guidelines and recommendations on implementing Vital Signs that make quality efforts and outcomes as tangible as developing and shipping features.
Can help other teams learn by example and follow the same path
Finally, a roadmap that makes software quality work and its results visible to one team can also make it visible to others. This visibility can inspire other teams to follow the same roadmap and learn from one another’s example. Once a critical mass of teams adopts a common roadmap, though the details may differ from team to team, the broader organizational culture evolves.

To continue the musical metaphor, roadmaps should act mainly as lead sheets that outline a tune’s melody and chords, not as note for note transcriptions. They provide a structure for exploring a creative space in harmony with other players, but leave a lot of room for interpretation and creativity. At the same time, studying transcriptions and recordings to learn the details of what worked for others is important to developing one’s own creativity.

Test Certified Roadmap

Foundation for QCI’s Quality Quest

original image source
Click for a larger image.

The Testing Grouplet’s Test Certified roadmap program was extremely effective at driving automated testing adoption and software quality improvements at Google. This image shows the program exactly as it was from about 2007 to 2011. It was small enough to fit comfortably, with further explanation, in a single Testing on the Toilet episode.

We designed the program to fit well with Google’s Objectives and Key Results (OKRs). It also sparked a very effective partnership with QA staff, who championed Test Certified as a means of collaborating more effectively with their client teams.³¹

The three levels are:

Set Up Measurements and Automation: These tasks are focused on setting up build and test infrastructure to provide visibility and control, and are fairly quick to accomplish.
Establish Policies (and early goals): With the infrastructure in place, the team can commit to a review, test, and submission policy to begin making improvements. Early goals, reachable within a month or two, provide just enough pressure to motivate a team-wide commitment to the policy.
Reach (long-term) Goals: On this foundation of infrastructure, policy, and early wins, the team is in a position to stretch for longer term goals. These goals should be achievable within six months or more.

Quality Quest

…and Roadmap development

At Apple, I literally copied these criteria into a wiki page as the first draft of Quality Quest.³² I then asked QCI members what we’d need to change to fit Apple’s needs, and we landed upon:

More integration with QA/manual/system testing, with a focus on striking a good balance with smaller tests.
No specific percentages of test sizes, which were the most hotly debated part of Test Certified (other than the name). Instead, we specified that teams should set their own goals in Levels Two and Three.
“Vital Signs” as a Level Three component, to encourage conversation and collaboration based on a suite of quality signals designed by all project stakeholders.

These programs did take time to design, to try out with early adopters, to incorporate feedback into, and to start spreading further. However, the time it took was a feature, not a bug. We were securing buy-in, avoiding potential resistance from people feeling that we were coercing them to conform to requirements they neither agreed with nor understood.

Also, while Quality Quest’s structure and goals were identical to Test Certified, we took liberties to adapt the program to our current situation. Much like writing a software program or system, we took the same basic structure and concepts and adapted them to our specific needs. Or, we took a good song and made our own arrangement, sang it in our own voice. Conversation, collaboration, patience, and persistence are essential to the process of developing an effective, sustainable, and successful in-house improvement program.

Quality Work and Results Visibility

Storytelling is essential to spreading language, leading change.

Great storytelling is essential to providing meaningful insight into quality outcomes and the work necessary to achieve them.³³ This can happen throughout the process, but sharing outcomes, methods, and lessons learned is critical to driving adoption of improved practices and making them stick.

Media, roadmaps, presentations, events
In fact, good stories can drive alignment via the same media we discussed earlier. Organizing a special event every so often can generate a critical mass of focus and energy towards sharing stories from across the company. Such events can raise the company wide software quality improvement effort to a new plateau. They also help prove that common principles and practices apply across projects, no matter the tech stack or domain, refuting the Snowflake Fallacy.
Make a strong point with a strong narrative arc
The key to telling a good story is adhering a strong narrative arc.³⁴ Here are three essential elements:
- Show the results up front—share your Vital Signs!
  First, don’t bury the lede!³⁵ We’re not trying to hook people on solving a mystery, we’re trying to hook people on the value of what we’re about to share. This holds whether you’ve already achieved compelling outcomes or if you’re still in the middle of the story and haven’t yet achieved your goals. In the latter case, you can still paint a compelling picture of what you’re trying to achieve, and why.
  
  Either way, having meaningful Vital Signs in place can make telling this part of the story relatively straightforward.
- Describe the work done to achieve them, and why
  Next, tell them what you had to do (or are trying to do now) to achieve these outcomes and what you learned while doing it. Don’t just give a laundry list of details, however.
  - Practices need principles! The mindset is more portable than the details.
    Practices need principles.³⁶ Help people understand why you applied specific practices—show how they demonstrate the mindset³⁷ that’s ultimately necessary to improve software quality. Technical details can be useful to make the principles concrete, but they’re ultimately of secondary importance to having the right mindset regardless of the technology.
- Make a call to action to apply the information
  Finally, give people something to do with all the information you just shared. Tell them how they can follow up with you or others, via email or Slack or whatever. Provide links to documentation or other resources where they can learn more about how to apply the same tools and methods on their own.

Building a Software Quality Culture

Cultivating resources to support buy-in

Resources	Skills	Alignment	Visibility
Training	✅
Documentation	✅
Internal media e.g., blogs, newsletters	✅	✅	✅
Roadmap program		✅	✅
Vision/strategy presentations		✅	✅
Mentors/advocates	✅	✅	✅
Internal events	✅	✅	✅

Here we can see how different resources serve to fulfill one or more of the essential needs for organizational change. There’s no specific order in which to build up these resources—it’s up to you to decide where to focus at each stage in your journey.

Mapping the Testing Grouplet and Quality Culture Initiative activities onto this table reveals how the same basic resources apply across vastly different company cultures.

Google Testing Grouplet

2005-2010

Resources	Examples
Training	Noogler (New Googler) Training, Codelabs
Documentation	Internal wiki
Internal media e.g., blogs, newsletters	Testing on the Toilet
Roadmap program	Test Certified
Vision/strategy presentations	Google Web Server story
Mentors/advocates	Test Mercenaries
Internal events	Two Testing Fixits, Revolution Fixit, TAP Fixit

At Google, we provided introductory unit testing training and Codelab sessions to new employees, or “Nooglers.” We made extensive use of the internal wiki, and of course Testing on the Toilet was our breakthrough documentation hit. TotT helped people to participate in the Test Certified program, which was based on the experience of the Google Web Server team. Then we scaled up our efforts by building the Test Mercenaries team and hosting four companywide Fixits over the years.

Apple Quality Culture Initiative

2018-present

Resources	Examples
Training	16-course curriculum for dev, QA, Project Managers
Documentation	Confluence
Internal media e.g., blogs, newsletters	Quality Blog, internal podcast
Roadmap program	Quality Quest
Vision/strategy presentations	QCI Roadshow, official internal presentation series
Mentors/advocates	QCI Ambassadors
Internal events	QCI Summit

At Apple, we knew that posting flyers in Apple Park bathrooms wouldn’t fly, but our extensive Training curriculum was wildly successful. We also made extensive use of our internal Confluence wiki, maintained a Quality Blog, and had a ton of fun producing our own internal podcast. Quality Quest was directly inspired by Test Certified, but adapted by the QCI community to better serve Apple’s needs.³⁸ We promoted our resources via dozens of QCI Roadshow presentations for specific teams and groups, as well as a few official, high visibility internal presentations. We recruited QCI Ambassadors from different organizations to help translate general QCI resources and principles to fit the needs of specific orgs. Finally, we organized a QCI Summit promote software quality stories from across the company, demonstrating how the Quality Mindset applies regardless of domain.

This comparison raises an important point that I’ve made in response to a common question:

“What are the important differences between companies?”

“What are the differences between companies?” with the assumption or implication that the differences are of key importance. Reflecting upon this just before leaving Apple, I realized…

The superficial details may differ…

…of course there are obvious differences. Google’s internal culture was much more open by default, and people back in the day had twenty percent of their time to experiment internally. Apple’s internal culture isn’t quite as open, and people are held accountable to tight deadlines. Even so…

…but they’re more alike than different.

…the companies are more alike than they might first seem. Both are large organizations composed of the same stuff, namely humans striving for both individual and collective achievement. Much like code from different projects, at the molecular level, they’re more alike than they are different.

Over time, I’ve come to appreciate these similarities as being ultimately more important than the differences. The same essential issues emerge, and the same essential solutions apply, differing only in their surface appearances. So no matter what project you’re on, or what company you work for, everybody everywhere is dealing with the same core problems. Not even the biggest of companies is immune, or otherwise special or perfect.

The Test Pyramid and Vital Signs

Two important concepts for making software quality itself actually visible at a fundamental level are The Test Pyramid and Vital Signs.

First, let’s understand the specific problems we intend to solve by making software quality visible and improving it in an efficient, sustainable way.

Working back from the desired experience

Inspired by Steve Jobs Insult Response

In this famous Steve Jobs video, he explains the need to work backward from the customer experience, not forward from the technology. So let’s compare the experience we want ourselves and others to have with our software to the experience many of us may have today.

What we want	What we have
Delight	Suffering
Efficiency	Waste
Confidence	Risk
Clarity	Complexity

What we want
We want to experience Delight from using and working on high quality software,³⁹ which largely results from the Efficiency high quality software enables. Efficiency comes from the Confidence that the software is in good shape, which arises from the Clarity the developers have about system behavior.
What we have
However, we often experience Suffering from using or working on low quality software, reflecting a Waste of excess time and energy spent dealing with it. This Waste is the result of unmanaged Risk leading to lots of bugs and unplanned work. Bugs, unplanned work, Risk, and fear take over when the system’s Complexity makes it difficult for developers to fully understand the effect of new changes.

Difficulty in understanding changes produces drag— i.e., Technical debt.

Difficulty in understanding how new changes could affect the system is the telltale sign of low internal quality, which drags down overall quality and productivity. The difference between actual and potential productivity, relative to internal quality, is technical debt.

This contributes to the common scenario of a crisis emerging…

Replace heroics with a Chain Reaction!

…that requires technical heroics and personal sacrifice to avert catastrophe. To get a handle on avoiding such situations, we need to create the conditions for a positive Chain Reaction.

By creating and maintaining the right conditions over time, we can achieve our desired outcomes without stress and heroics.

The main obstacle to replacing heroics with a Chain Reaction isn’t technology…

The challenge is belief—not technology

A little awareness goes a long way

…it’s an absence of awareness or belief that a better way exists.⁴⁰^,⁴¹

Many of these problems have been solved for decades
Despite the fact that many quality and testing problems have been solved for decades…⁴²
Many just haven’t seen the solutions, or seen them done well…
…many still haven’t seen those solutions, or seen them done well.⁴³
The right way can seem easy and obvious—after someone shows you!
The good news is that these solutions can seem easy and obvious—after they’ve been clearly explained and demonstrated.⁴⁴
What does the right way look like?
So how do we get started showing people what the right way to improve software quality looks like?

The Test Pyramid

A balance of tests of different sizes for different purposes

We’ll start with the Test Pyramid model,⁴⁵^,⁴⁶ which represents a balance of tests of different sizes for different purposes.

Realizing that tests can come in more than one size is often a major revelation to people who haven’t yet been exposed to the concept.⁴⁷ It’s not a perfect model—no model is—but it’s an effective tool for pulling people into a productive conversation about testing strategies for the first time.⁴⁸

(The same information as above, but in a scrollable HTML table:)

Size	Scope	Ownership	Code visibility	Dependencies	Control/ Reliability/ Independence	Resource usage/ Maint. cost	Speed/ Feedback loop	Confidence
Large (System, E2E)	Entire system	QA, some developers	Details not visible	All	Low	High	Slow	Entire system
Medium (Integration)	Components, services	Developers, some QA	Some details visible	As few as possible	Medium	Medium	Faster	Contract between components
Small (Unit)	Functions, classes	Developers	All details visible	Few to none	High	Low	Fastest	Low level details, individual changes

The Test Pyramid helps us understand how different kinds of tests give us confidence in different levels and properties of the system.⁴⁹ It can also help us break the habit of writing large, expensive, flaky tests by default.⁵⁰

Small tests are unit tests that validate only a few functions or classes at a time with very few dependencies, if any. They often use test doubles⁵¹ in place of production dependencies to control the environment, making the tests very fast, independent, reliable, and cheap to maintain. Their tight feedback loop⁵² enables developers to detect and repair problems very quickly that would be more difficult and expensive to detect with larger tests. They can also be run in local and virtualized environments and can be parallelized.
Medium tests are integration tests that validate contracts and interactions with external dependencies or larger internal components of the system. While not as fast or cheap as small tests, by focusing on only a few dependencies, developers or QA can still run them somewhat frequently. They detect specific integration problems and unexpected external changes that small tests can’t, and can do so more quickly and cheaply than large system tests. Paired with good internal design, these tests can ensure that test doubles used in small tests remain faithful to production behavior.⁵³
Large tests are full, end to end system tests, often driven through user interface automation or a REST API. They’re the slowest and most expensive tests to write, run, and maintain, and can be notoriously unreliable. For these reasons, writing large tests by default for everything is especially problematic. However, when well designed and balanced with smaller tests, they cover important use cases and user experience factors that aren’t covered by the smaller tests.

Thoughtful, balanced strategy == Reliability, efficiency

Each test size validates different properties that would be difficult or impossible to validate using other kinds of tests. Adopting a balanced testing strategy that incorporates tests of all sizes⁵⁴ enables more reliable and efficient development and testing—and higher software quality, inside and out.

Inverted Test Pyramid

Many larger tests, few smaller tests

Of course, many projects have a testing strategy that resembles an inverted Test Pyramid, with too many larger tests and not enough smaller tests.

This leads to a number of common problems:

Tests tend to be larger, slower, less reliable
The tests are slower and less reliable than they could be compared to relying more on smaller tests.
Broad scope makes failures difficult to diagnose
Because large tests execute so much code, it might not be easy to tell what caused a failure.
Greater context switching cost to diagnose/repair failure
That means developers have to interrupt their current work to spend significant time and effort diagnosing and fixing any failures.
Many new changes aren’t specifically tested because “time”
Since most of the tests are large and slow, this incentivizes developers to possibly skip writing or running them because they “don’t have time.”
People ignore entire signal due to flakiness…
Worst of all, since large tests are more prone to be flaky,⁵⁵ people will begin to ignore test failures in general. They won’t believe their changes cause any failures, since the tests were failing before—they might even be flagged as “known failures.”⁵⁶ And as I mention elsewhere…
…fostering the Normalization of Deviance
…the Space Shuttle Challenger’s O-rings suffered from “known failures” as well, cultivating the “Normalization of Deviance” that led to disaster.

Causes

Let’s go over some of the reasons behind this situation.

Features prioritized over internal quality/tech debt
People are often pressured to continue working on new features that are “good enough” instead of reducing technical debt. This may be especially true for organizations that set aggressive deadlines and/or demand frequent live demonstrations.⁵⁷
“Testing like a user would” is more important
Again, if “testing like a user would” is valued more than other kinds of testing, then most tests will be large and user interface-driven.
Reliance on more tools, QA, or infrastructure (Arms Race)
This also tends to instill the mindset that the testing strategy isn’t a problem, but that we always need more tools, infrastructure, or QA headcount. I call this the “Arms Race” mindset.
Landing more, larger changes at once because “time”
Because the existing development and testing process is slow and inefficient, individuals try to optimize their productivity by integrating large changes at once. These changes are unlikely to receive either sufficient testing or sufficient code review, increasing the risk of bugs slipping through. It also increases the chance of large test failures that aren’t understood. The team is inclined to tolerate these failures, because there isn’t “time” to go back and redo the change the right way.
Lack of exposure to good examples or effective advocates
As mentioned before, many people haven’t actually witnessed or experienced good testing practices before, and no one is advocating for them. This instills the belief that the current strategy and practices are the best we can come up with.
We tend to focus on what we directly control—and what management cares about! (Groupthink)
In such high stress situations, it’s human nature to focus on doing what seems directly within our control in order to cope. Alternatively, we tend to prioritize what our management cares about, since they have leverage over our livelihood and career development. It’s hard to break out of a bad situation when feeling cornered—and too easy to succumb to Groupthink without realizing it.

So how do we break out of this corner—or help others to do so?

Quality work can be hard to see. It’s hard to value what can’t be seen—or to do much of anything about it.

We have to overcome the fundamental challenge of helping people see what internal quality looks like. We have to help developers, QA, managers, and executives care about it and to resist the Normalization of Deviance and Groupthink. We need to better show our quality work to help one another improve internal quality and break free from the Arms Race mindset.

In other words, internal quality work and its impact is a lot like The Matrix…

“Unfortunately, no one can be told what the Matrix is. You have to see it for yourself.”

—Morpheus, The Matrix

One way to start showing people The Matrix is to get buy-in on a set of…

Vital Signs

…“Vital Signs.” Vital Signs are a collection of signals designed by a team to reflect quality and productivity and to rapidly diagnose and resolve problems.⁵⁸

Intent

Comprehensive and make sense to the team and all stakeholders.
They should be comprehensive and make sense at a high level to everyone involved in the project, regardless of role.
Not merely metrics, goals, or data
We’re not collecting them for the sake of saying we collect them, or to hit a goal one time and declare victory.⁵⁹
Information for repeated evaluation
We’re collecting them because we need to evaluate and understand the state of our quality and productivity over time.
Inform decisions whether or not to act in response
These evaluations will inform decisions regarding how to maintain the health of the system at any moment.

Common elements

Some common signals include:

Pass/fail rate of continuous integration system
The tests should almost always pass, but failures should be meaningful and fixed immediately.
Size, build and running time, and stability of small/medium/large test suites
The faster and more stable the tests, the fewer resources they consume, and the more valuable they are.
Size of changes submitted for code review and review completion times
Individual changes should be relatively small, and thus easier and faster to review.
Code coverage from small to medium-small test suites
Each small-ish test should cover only a few functions or classes, but the overall coverage of the suite should be as high as possible.⁶⁰
Passing use cases covered by medium-large to large and manual test suites
For larger tests, we’re concerned about whether higher level contracts, use cases, or experience factors are clearly defined and satisfied before shipping.
Number of outstanding software defects and Mean Time to Resolve
Tracking outstanding bugs is a very common and important Vital Sign. If you want to take it to the next level, you can also begin to track the Mean Time to Resolve⁶¹ these bugs. The lower the time, the healthier the system.

Most of these specific signals aim to reveal how slow or tight the feedback loops are throughout the development process.⁶² Even high code coverage from small tests implies that developers can make changes faster and more safely. Well scoped use cases can lead to more reliable, performant, and useful larger tests.

Other potentially meaningful signals

Some other potentially meaningful signals include…

Static analysis findings (e.g., complexity, nesting depth, function/class sizes)
Popular source control platforms, such as GitHub, can incorporate static analysis findings directly into code reviews as well. This encourages developers to address findings before they land in a static analysis platform report.⁶³
Dependency fan-out
Dependencies contribute to system and test complexity, which contribute to build and test times. Cutting unnecessary dependencies and better managing necessary ones can yield immediate, substantial savings.
Power, performance, latency
These user experience signals aren’t caught by traditional automated tests that evaluate logical correctness, but are important to monitor.
Anything else the team finds useful for its purposes
As long as it’s a clear signal that’s meaningful to the team, include it in the Vital Signs portfolio.

Use them much like production telemetry

Treat Vital Signs like you would any production telemetry that you might already have.

Keep them current and make sure the team pays attention to them.
Clearly define acceptable levels—then achieve and maintain them.
Identify and respond to anomalies before urgent issues arise.
Encourage continuous improvement—to increase productivity and resilience.
Use them to tell the story of the system’s health and team culture.

Example usage: issues, potential causes (not exhaustive!)

Here are a few hypothetical examples of how Vital Signs can help your team identify and respond to issues.

Builds 100% passing, high unit test coverage, but high software defects
If your builds and code coverage are in good shape, but you’re still finding bugs…
- Maybe gaps in medium-to-large test coverage, poorly written unit tests
  …it could be that you need more larger tests. Or, it could be your unit tests aren’t as good as you think, executing code for coverage but not rigorously validating the results.
Low software defects, but schedule slipping anyway
If you don’t have many bugs, but productivity still seems to be dragging…
- Large changes, slow reviews, slow builds+tests, high dependency fan out
  …maybe people are still sending huge changes to one another for review. Or maybe your build and test times are too slow, possibly due to excess dependencies.
Good, stable, fast tests, few software defects, but poor app performance
Maybe builds and tests are fine, and there are few if any bugs, but the app isn’t passing performance benchmarks.
- Discover and optimize bottlenecks—easier with great testing already in place!
  In that case, your investment in quality practices has paid off! You can rigorously pursue optimizations, without the fear that you’ll unknowingly break behavior.

Getting started, one small step at a time

Here are a few guidelines for getting started collecting Vital Signs. First and foremost…

Don’t get hung up on having the perfect tool or automation first.
Do not get hung up on thinking you need special tools or automation at the beginning. You may need to put some kind of tool in place if you have no way to get a particular signal. But if you can, collect the information manually for now, instead of wasting time flying blind until someone else writes your dream tool.
Start small, collecting what you can with tools at hand, building up over time.
You also don’t need to collect everything right now. Start collecting what you can, and plan to collect more over time.
Focus on one goal at a time: lowest hanging fruit; biggest pain point; etc.
As for which Vital Signs to start with, that’s totally up to you and your team. You can start with the easiest signals, or the ones focused on your biggest pain points—it doesn’t matter. Decide on a priority and focus on that first.
Update a spreadsheet or table every week or so—manually, if necessary.
If you don’t have an automated collection and reporting system handy, then use a humble spreadsheet or wiki table. Spend a few minutes every week updating it.
Observe the early dynamics between the information and team practice.
Discuss these updates with your team, and see how it begins to shift the conversation—and the team’s behavior.
Then make a case for tool/infrastructure investment based on that evidence.
Once you’ve got evidence of the value of these signals, then you can justify and secure an investment in automation.⁶⁴

What Software Quality Is and Why It Matters

To advocate effectively for an investment in software quality, we need to define clearly what it is and why it’s so important.

Is High Quality Software Worth the Cost?

martinfowler.com/articles/is-quality-worth-cost.html

In May 2019, Martin Fowler published “Is High Quality Software Worth the Cost?,” a brief article describing the tradeoffs and benefits of software quality.

Quality Type	Users	Developers
External	Makes happy	Keeps productive
Internal	Keeps happy	Makes productive

He distinguished between:

External quality, which obviously makes users happy. This, in turn, keeps developers productive, since they don’t need to respond to problems reported by users. Then Martin argues that…
Internal quality helps keep users happy, by enabling developers to evolve the software easily and to resolve problems quickly. This is because high internal quality makes developers productive, since there’s less cruft and unnecessary complexity slowing them down from making changes.

Effects of quality on productivity over time

Martin also used this hypothetical graph, based on his experience, to illustrate the impact of quality tradeoffs over time.

With Low internal quality, progress is faster at the beginning, but begins to flatten out quickly.
With High internal quality, progress is slower at the beginning, but the investment pays off in greater productivity over time.
The break even point between the two approaches arrives within weeks, not months.
Though Martin’s original graph didn’t show this, the difference in productivity between low and high internal quality is one way to visualize technical debt.⁶⁵

“Fast, cheap, or good: pick two three”

High quality software is cheaper to produce

Martin’s conclusion is that higher quality makes software cheaper to produce in the long run—that the “cost” of high quality software is negative. “Fast, cheap, or good: pick two” doesn’t hold as the system evolves. It may make sense at first to sacrifice good to get a cheaper product to market quickly. But over time, investing in “good” is necessary to continue delivering a product quickly and at a competitive cost.

Internal quality aids understanding

High quality software is cheaper because it’s easier to work with

Internal quality essentially helps developers continue to understand the system as it changes over time:⁶⁶

Fosters productivity due to the clarity of the impact of changes
When they clearly understand the impact of their changes, they can maintain a rapid, productive pace.
Prevents foreseeable issues, limits recovery time from others
Understanding helps them prevent many foreseeable issues, and resolve any bugs quickly and effectively.
Provides a buffer for the unexpected, guards against cascading failures
These qualities help create a buffer for handling unexpected events,⁶⁷ while also guarding against cascading failures.
Your Admins/SREs will thank you! It helps them resolve prod issues faster.
Your system administrators or SREs will be very grateful for building such resilience into your system, as it helps their response times as well.
Counterexamples: Global supply chain shocks; Southwest Airlines snafu
For counterexamples, recall the global supply chain shocks resulting from the COVID-19 pandemic, or the December 2022 Southwest Airlines snafu. These systems worked very efficiently in the midst of normal operating conditions. However, their intolerance for deviations from those conditions rendered them vulnerable to cascading failures.
Quality, clarity, resilience are essential requirements of prod systems
Consequently, internal software quality, and the clarity and resilience it enables, are essential requirements of any production software system.

Focusing on internal software quality is good for business…because it’s the right thing to do.

As mentioned, Martin Fowler’s argument is that internal software quality is good for business—it’s such a compelling argument that I brought it up first. However, he prefers making only this economic argument for quality. He asserts that appeals to professionalism are moralistic and doomed, as they imply that quality comes at a cost.⁶⁸

I disagree that we should sidestep appeals to professionalism entirely, and that they’re incompatible with the economic argument. I think it’s worth exploring why professionalism matters, both because it is moral and because customers increasingly expect high quality software they can trust.

Quality without function is useless—but function without quality is dangerous.

Put more bluntly, high quality may be useless without sufficient functionality, but as we’ll see, functionality without quality can be dangerous. Professionalism and morals only appear to come at a cost today. They’re actually investments that avoid more devastating future costs when a lack of focus on quality begins to impact users.

Remember the examples of unit testable bugs I described from my personal experience.

Byte padding: USCG/USN Navigation
A byte padding mistake could’ve caused US Coast Guard and US Navy vessels to crash.
Goto Fail: macOS/iOS Security
An extra goto statement endangered secure communications on billions of devices.
Heartbleed: Secure Sockets Layer
Unchecked user input compromised secure sockets on the internet for years.

I caught the first bug, fortunately; but in the last two cases, a lack of internal quality and automated testing put many people at risk.

The point is that…

Quality Culture is ultimately Safety Culture

…a culture that values and invests in software quality is a Safety Culture. Society’s dependence on software to automate critical functions is only increasing. It’s our duty to uphold that public trust by cultivating a quality culture.

It’s ultimately what’s best for business and our own personal success as well.

Users don’t care about your profits.
They care if they can trust your product.

Users are the source of your profits, but they don’t care about your profits or your stock price. They care whether or not they can trust your product.

Why Software Quality Is Often Unappreciated and Sacrificed

We need to understand, if software quality is so important, why it’s so often unappreciated and sacrificed.

Automated (especially unit) testing

Why hasn’t it caught on everywhere yet?

Apple, in January 1992, identified the need to make time for training, documentation, code review—and unit testing!

At Apple, I found a document from January 1992 specifically identifying the need to make time for training, documentation, code review—and unit testing! That’s not just before Test-Driven Development and the Agile Manifesto, that’s before the World Wide Web!

There are a few reasons why unit testing in particular hasn’t caught on everywhere yet:⁶⁹

People think it’s obvious and easy—therefore lower value
Many developers think it’s obvious and easy, and therefore can’t provide much value.⁷⁰
Many still haven’t seen it done well—or may have seen it done poorly
Many others still haven’t seen it done well, or may have seen it done poorly, leading to the belief that it’s actually harmful.
There’s always a learning curve involved
For those actually open to the idea, there’s still a learning curve, which they many not have the time to climb.
Bad perf review if management doesn’t care about testing/internal quality
They may also fear spending time on testing and internal quality will result in a bad performance review if their management doesn’t care about it.⁷¹

Have you heard these ones before?

Common excuses for sacrificing unit testing and internal quality

Of course, people give their own reasons for not investing in testing and quality, including the following:

Tests don’t ship/internal quality isn’t visible to users (i.e., cost, not investment)
“We don’t ship tests, and users don’t care about internal quality.” Meaning, testing seems like a cost, not an investment.
“Testing like a user would” is the most important kind of testing
As mentioned before, “Testing like a user would” is considered most important, so investing in smaller tests and internal quality seems unnecessary.
Straw man: 100% code coverage is bullshit
The straw man that “writing tests to get 100% code coverage is bullshit.” This speaks to a fundamental ignorance about how to write good tests or to use coverage the right way.
Straw man: Testing is a religion (implying: I’m better than those people)
For some reason, technical people, especially programmers, like to pound their chests as being against so called testing “religion” and those who practice it. It’s a flimsy excuse for trying to score social points by virtue signaling in view of one’s perceived peers. Framing a potentially reasonable discussion of different testing ideas in such a way only serves to shut it down for a superficial, unprofessional ego boost.⁷²
“My code is too hard to test.” (The Snowflake Fallacy)
The common, evidence free “My code is too hard to test” assertion, which I call the Snowflake Fallacy.
“I don’t have time to test.”
Finally, “I don’t have time to test.” This could be a brush off, or a genuine indication that they don’t know how and can’t spare the time to learn—and management doesn’t care.

Business as Usual

All these reasons are why Business as Usual persists, as well as the Complexity, Risk, Waste, and Suffering that everyone’s used to. This then allows the Normalization of Deviance to take hold.

Normalization of Deviance

Coined by Diane Vaughan in The Challenger Launch Decision

The explosion of the Space Shuttle Challenger shortly after takeoff on January 28, 1986 exposed the potentially deadly consequences of common organizational failures.
Image from https://commons.wikimedia.org/wiki/File:Challenger_explosion.jpg. In the public domain from NASA.

Diane Vaughan introduced this term in her book about the Space Shuttle Challenger explosion in January 1986.⁷³ My paraphrased version of the definition is:

A gradual lowering of standards that becomes accepted, and even defended, as the cultural norm.

Space Shuttle Challenger Accident Report

history.nasa.gov/rogersrep/genindex.htm, Chapter VI, pp. 129-131

The key evidence of this phenomenon is articulated in chapter six of the Rogers commission report.⁷⁴

The O-rings didn’t just fail on the Challenger mission on 1986-01-28…
Many of us may know that the O-rings lost elasticity in the cold weather, allowing gasses to escape which led to the explosion.
…anomalies occurred in 17 of the 24 (70%) prior Space Shuttle missions…
However, you may not realize that NASA detected anomalies in O-ring performance in 17 of the previous 24 shuttle missions, a 70 percent failure rate.
…and in 14 of the previous 17 (82%) since 1984-02-03
Even scarier, anomalies were detected in 14 of the previous 17 missions, for an 82% failure rate.
Multiple layers of engineering, management, and safety programs failed

This wasn’t only one person’s fault—multiple layers of engineering, management, and safety programs failed.⁷⁵ However, Normalization of Deviance isn’t the end of the problem.

NASA: NoD leads to Groupthink

Terry Wilcutt and Hal Bell of NASA delivered their presentation The Cost of Silence: Normalization of Deviance and Groupthink in November 2014.⁷⁶ On the Normalization of Deviance, they noted that:

“There’s a natural human tendency to rationalize shortcuts under pressure, especially when nothing bad happens. The lack of bad outcomes can reinforce the ‘rightness’ of trusting past success instead of objectively assessing risk.”

—Terry Wilcutt and Hal Bell, The Cost of Silence: Normalization of Deviance and Groupthink

They go on to cite the definition of Groupthink from Irving Janis:

“[Groupthink is] a quick and easy way to refer to a mode of thinking that persons engage in when they are deeply involved in a cohesive in-group, when concurrence-seeking becomes so dominant that it tends to override critical thinking or realistic appraisal of alternative courses of action.”

—Irving Janis, Groupthink: psychological studies of policy decisions and fiascoes

NASA: Symptoms of Groupthink

They then describe the symptoms of Groupthink:⁷⁷

Illusion of invulnerability—because we’re the best!
Belief in Inherent Morality of the Group—we can do no wrong!
Collective Rationalization—it’s gonna be fine!
Out-Group Stereotypes—don’t be one of those people!
Self-Censorship—don’t rock the boat!
Illusion of Unanimity—everyone goes along to get along!
Direct Pressure on Dissenters—because everyone else agrees!
Self-Appointed Mindguards—decision makers exclude subject matter experts from the conversation.

Any of these sound familiar? Hopefully from past, not current, experiences.

A common result of Groupthink is the well known systems thinking phenomenon called “The Cobra Effect.”

The Cobra Effect

ourworld.unu.edu/en/systems-thinking-and-the-cobra-effect

Illustration of a cobra from the _Our World_ article "Systems Thinking
and the Cobra Effect" by Barry Newell and Christopher Doll, published
2015-09-16.
Photo: Biodiversity Heritage Library. Creative Commons BY 2.0 DEED (cropped).

Pay bounty for dead cobras
This comes from the true story of when the British administration in India offered people a bounty to help reduce the cobra population.
Cobras disappear, but still paying
This worked, but the British noticed they kept paying bounties when they didn’t see any more cobras around.
People were harvesting cobras
They realized people were raising cobras just to collect the bounty…
Ended bounty program
…so they ended the bounty program.
More cobras in streets than before!
People then threw their now useless cobras into the streets, making the problem worse than before.
Fixes that Fail: Simplistic solution, unforeseen outcomes, worse problem
This is an example of the “Fixes That Fail” archetype. This entails applying an overly simplistic solution to a complex problem, resulting in unforeseen outcomes that eventually make the problem worse.

The Arms Race

Systems thinking should replace brute force in the long run.

In software, I call this “The Arms Race.” This may sound familiar:

Investment to create capacity for existing practices and processes…
We invest people, tools, and infrastructure into expanding the capacity of our existing practices and processes.
Exhaustion of capacity leads to more people, tools, and infrastructure
Things are better for a while, but as the company and its projects and their complexity grow, that capacity’s eventually exhausted. This leads to further investment of people, tools, and infrastructure.
Then that capacity’s exhausted…
Then eventually that capacity’s exhausted…and the cycle continues.
AI-generated tests may help people get started in some cases…
There’s a lot of buzz lately about possibly using AI-generated tests to take the automated testing burden off of humans. There may perhaps be room for using AI-generated tests as a starting point.
…but beware of abdicating professional responsibility to unproven tools.
However, AI can’t read your mind to know all your expectations—all the requirements you’re trying to satisfy or all the assumptions you’re making.⁵ Even if it could, AI will never absolve you of your professional responsibility to ensure a high quality and trustworthy product.⁷⁸
In the end, we can’t win the arms race against growth and complexity.
We need to realize we can’t win the Arms Race against growth and complexity.

Rogers Report Volume 2: Appendix F

history.nasa.gov/rogersrep/v2appf.htm

Richard Feynman’s final statement on the Challenger disaster is a powerful reminder of our human limitations:⁷⁹

“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.”

—Richard Feynman, Personal Observations on Reliability of Shuttle

As professionals, we must resist the Normalization of Deviance, Groupthink, and the Arms Race. We can’t allow them to become, or to remain, Business as Usual.

Building a Software Quality Culture

Achieving and maintaining high software quality by eliminating unnecessary complexity, risk, waste, and suffering often requires changing the culture of your organization.

Let’s define specifically what we mean by culture.⁸⁰ One possible definition is that:

Culture is the shared lifestyle of a team or organization.

This lifestyle is what you see people doing together day to day, and the way they do it. For our purposes, we need to understand the essence of lifestyle, where it comes from and what shapes it. So here’s an expanded definition of “culture”:

Culture is the emergent result of a shared mindset manifest through concrete behaviors.

In order to influence lifestyle, which is the result, we have to influence concrete behaviors. In order to influence those, we need to influence people’s mindset.

However, the absolute least effective way to influence mindsets is to…

Don’t shame people for doing Business as Usual. Help them recognize and change it.

…shame people for falling into the common traps we’ve learned about. They’re so common because it’s human nature to fall into them. Instead, we need to help everyone recognize these traps, focus on avoiding them, and remain vigilant against them, and change Business as Usual together.

Challenging & Changing Business as Usual

Changing cultural norms requires understanding them first.

Challenging cultural norms supporting Business as Usual isn’t easy, and it’s honestly frightening. There’s actually a good reason for that:

Everything in a culture happens for a reason—challenge reasons thoughtfully!
The existing norms do exist for a reason. The question is whether that reason holds up today. So we need to try to understand those reasons first, then challenge them thoughtfully.⁸¹
People often feel invested in old ways and fear the cost of new ways.
This is because many people feel invested in their existing methods. They fear the cost and risk of changing those methods, even if they no longer provide the value they once did. This is common human nature…
We’re often asked for data to prove a different approach actually helps—before trying it…
…and is why some will ask for data or other proof that a change will be effective before trying it.
…while we throw time, money, people, and tools at existing processes.
At the same time, they’ll continue throwing resources into existing processes as they have for years.

The most productive way to approach such a challenge requires taking the time to gather enough information and build some trust.

Challenge: Haven’t we proven that the existing ways aren’t (totally) working?
You can then carefully question whether current methods are effective, or effective enough on their own, given substantial historical evidence to the contrary. On this basis, you may persuade some to reexamine the problem and try a different approach.

Notice that this challenge to the status quo is in the form of a thoughtful question.

The power of asking good questions

“What could we do differently to improve our software quality?”

The ultimate question is “What could we do differently to improve our software quality?”⁸² However…

Questions develop shared understanding of the culture before changing it
…there are many more questions necessary to help everyone understand why things are the way they are, what needs to change—and how.
Asking encourages thinking about a problem, possible solutions
Asking good questions includes people in the process of discovery and finding solutions, which develops their own knowledge and problem solving skills.
Good questions enable new information, perspectives, and ideas to emerge
Good questions enable people to share information, perspectives, and ideas that wouldn’t otherwise arise if they were only told what to think or do.
Asking what to do is more engaging than telling what to do—produces buy in
Taking the time to ask people questions pulls people into the change process, which increases their motivation to buy into any proposed changes that emerge.

This is summed up by a great Senegalese proverb shared by Bono in Chapter 20 of John Doerr’s Measure What Matters:

“If you want to cut a man’s hair, it is better if he is in the room.”

There is one catch, though.

If people don’t know where to begin, are stuck in old ways, or are under stress…direct near term guidance may be necessary.

Your audience may get stuck. They may currently have no idea about what to change or how—because they lack knowledge, experience, or imagination to consider approaches beyond the status quo. Or, as we’ll discuss shortly, they may be under incredible stress and unable to think clearly or creatively. In that case, you may need to provide more direct guidance, at least in the beginning.⁸³

Whatever their situation, the most effective way to influence mindsets is to help people solve their own problems by…

Sell—don’t tell!

…selling, not telling.

People don’t like being told to change their behaviors, because it’s like being told to change their minds. If you know anything about people, you know they hate changing their minds unless they are doing the changing, by their own choice.

This is why we’ve emphasized asking questions, raising awareness, and working together to make quality visible—instead of imposing process changes or technical solutions through force. People need to understand and buy into changes in order to embrace them fully. We can’t force them to make changes they don’t perceive as necessary or valuable if we want the change to be successful.

Of course, not everyone’s going to change their mindset at once—some may never come around at all. However, our ultimate goal should be to…

Make the right thing the easy thing!⁸⁴

As we continue working to improve quality and make it visible, it will get easier and easier to do both. Practices and their results will become more accessible, encouraging wider and wider adoption. Eventually, we want to make it harder not to do the right thing, because the right thing will happen by default.

This will be challenging, and take time. It’s important to identify the right people to engage directly in the beginning, when we’re starting the process, and who to put off until later.

Focus on the Early Majority/Total Product

Geoffrey A. Moore, Crossing the Chasm, 3rd Edition

This returns us to the Crossing the Chasm model. As Instigators leading people to improve software quality and make it visible, focus your energy on connecting with other Instigators and the early Early Majority. Don’t worry so much about the rest—focus on delivering the Total Product, and it will take care of the other groups.

The Rainbow of Death

mike-bland.com/the-rainbow-of-death

The Rainbow of Death can help you analyze your progress over time and discover where the gaps in your program are. However, don’t necessarily start with the Rainbow of Death, as it may be to overwhelming at the beginning.

Instigating Culture Change

Essential needs an internal community must support

Instead, have a conversation to decide where to begin within the Skill Acquisition, Alignment, and Visibility cycle. You can decide to tackle the highest priority issue, the most urgent issue, the quickest win you can get, etc. You can take any of the examples from Google and Apple that I’ve shared as a starting point, then tailor them to your specific needs. The important thing is to focus, simplify, and ship something to build momentum for the next step in the cycle.

Calls to Action

No matter what project you’re on, or what company you work for, you can do something about these problems.⁸⁵ The resources for dealing with them are just as available to you as they are to any other company, given the right mindset.

I’ve shared many concepts behind the changes you may need to make. These are important, but relatively straightforward to grasp. The harder part of the problem isn’t getting people to pay attention to these concepts and to understand them, but to act on them. The most important skill you’ll need to make that happen…

Learn about leadership!

The biggest challenge isn’t technical—it’s changing the mindset.

…is leadership.

Many technical people think they’re above this fuzzy “people stuff,” that driving improvements are all about data and logic and meritocracy. But look where that’s gotten us as an industry with regard to software quality and avoidable damage done to society. The purely technological mindset, and its lack of appreciation for “people stuff,” is why good practices like those we’ve discussed so often fail to spread.

Leadership is also an eminently transferable skill, highly valuable and useful no matter where you find yourself during your career. It has nothing to do with the title you happen to hold, but with how you conduct yourself to achieve alignment with others. It’s a vast topic that you can study for life, but here are a few of my favorite starting points at the moment:

John “Add Value to People” Maxwell, The 5 Levels of Leadership
John Maxwell’s personal mission is to “add value to people,” which is my favorite short definition of “leadership.” His book The 5 Levels of Leadership clearly illustrates how leadership transcends title and decision making authoritah, and what’s required to realize outstanding leadership potential.
L. David Marquet, Leadership is Language
David Marquet’s Leadership is Language is a playbook highlighting how to replace Industrial Revolution era communication habits with more empowering and productive modern habits. It illustrates how to stop merely telling people what to think and do, and how to encourage everyone to grow as decision makers and leaders. The results can literally make the difference between life and death.
Liz Wiseman, Multipliers
In a similar vein, Liz Wiseman’s Multipliers illustrates the distinction between “Diminishers” that drain intelligence and energy from organizations an “Multipliers” that amplify people’s capabilities. It catalogs and contrasts the behaviors of each, encouraging explicit awareness of our own tendencies, weaknesses, and strengths.
Scott Miller and David Morey, The Leadership Campaign
Miller and Morey’s The Leadership Campaign is a guide to the dynamics of stepping forward to lead a movement, modeled directly on political campaigning. The focus is on clarity of messaging and organization, building momentum, and taking advantage of opportunities.
Michael Bungay Stanier, The Coaching Habit
Michael Bungay Stanier’s The Coaching Habit is another book focused on language, specifically when it comes to leading individuals to think through their own challenges. It talks about how to “tame your Advice Monster” and focusing on helping people develop their own solutions and capabilities.
Ken Blanchard, et. al., Leadership and the One-Minute Manager
Leadership and the One-Minute Manager describes Blanchard’s “Situational Leadership” model. This model describes the need to adapt one’s leadership style to each individual over time based on their current capabilities.

Now remember from Crossing the Chasm that it falls on the Instigators to lead adoption of technologies and practices across an organization. As an Instigator, one of the lessons you’ll learn is that, sometimes…

Instigator Theory

It’s easier to change the rest of the world than your own team.

…it’s easier to change the rest of the world than your own team. I call this phenomenon Instigator Theory. However, as frustrating as this is, and as long as it takes to overcome, the basic outline of what you need to do is straightforward:

Phase One: Connect with—or build—a community of fellow Instigators
First, find your people. Put your feelers out. Invite folks to coffee, then start organizing informal meetings, and send out open invitations. See who really cares about software quality and is willing to show up to do something about it.
Phase Two: Develop resources and do the work
Next, employ your leadership skills and challenge the community to develop resources for helping individuals acquire new skills and teams align on quality practices.
- Focus, simplify, and take your time in the beginning
  As mentioned before, there’s no need to rush, and take care not to spread yourselves too thin.
- Every earlier success lays a foundation and creates space for future effort
  In time, every win you deliver will draw more people into the community, which then creates the capacity to deliver the next win.
Phase Three: Share the results—make the work and its impact visible
As your community begins delivering resources, and people put those resources to good use, make all that work and its impact visible early and often. Radiate the good work your doing and its results into the environment as much as you can. And as part of generating that radiation…
- Recognize the value of one another’s contributions!
  …make sure to recognize the value that each member of the community adds to the effort! This is long term work that’s often thankless, as the focus for most of the organization remains on doing business as usual. Recognizing everyone’s value is a big part of keeping up morale and momentum, and makes that value visible to others in the organization as well.

Finally, I’d like to leave you with a concrete list of things you and your fellow Instigators can work to change.

Where we are	Where we’d like to go
Slow, unreliable, expensive processes	Fast, reliable, efficient feedback loops
Lots of duplicated, complex code	Well-factored, readable, testable code
Large, complex, monolithic code reviews	Small, digestible, easily reviewable changes
Large, complex, flaky test suites	Balanced, stable Test Pyramid-based suites
Expensive metrics people can’t act upon	Meaningful, useful Vital Signs taken seriously
Reinventing the wheel in wasteful silos	Sharing stories, language, useful examples
Complexity, risk, waste, and suffering	Clarity, confidence, efficiency, and delight
Testing (only) like a user would	Testing like a user’s life depends on it!

Let’s compare where many of us are today, without good quality practices in place, to where we’d like to get everyone to go.⁸⁶

Ultimately, we want to replace painful, expensive processes with fast, reliable, and efficient feedback loops. We can start to do that by…
…rejecting duplication and excess complexity in our code, and by writing readable, testable code instead.
We can reject large, monolithic code reviews hiding lots of bugs and insist upon smaller, more reviewable changes.
We can reduce the size, complexity, and unreliability of existing test suites by evolving towards a balanced, reliable suite based on Test Pyramid concepts.
We can throw out meaningless metrics that are expensive and painful to collect in favor of meaningful, actionable, and relatively cheap Vital Signs.
We can stop wasting resources on having teams wrestle with common quality problems separately, and help one another by sharing stories, language, and working examples.
Improving our software quality can minimize complexity, risk, waste, and suffering, and the increased understanding it affords will yield clarity, confidence, efficiency, and delight.
Once freed from the mental trap of testing only like a user would, we can begin testing like a user’s life depends on it.

Ultimately, creating great, high quality software shouldn’t require heroics, sacrifice, or endless pursuit of technologies or resources.

Making software quality visible will…
start a Chain Reaction that will…
minimize suffering—and ultimately…
Make the right thing the easy thing!

Ensuring that everyone can see what high software quality work looks like helps create the conditions for a positive Chain Reaction. As principles and practices spread, and priorities align around quality, we’ll see suffering subside as we keep making the right thing easier! Then maybe one day, it’ll be so damn easy everyone can’t help but to do the right thing by default.

mike-bland.com/making-software-quality-visible

Thank you!

Acknowledgments

I appreciates all the folks who’ve contributed to this presentation!

Ono Vaticone, Microsoft
John Turek, Aetion
Isaac Truett, EAB
Chris Douglas, AARP
Jake Spracher
Oleksiy Shepetko, Microsoft
Alex Buccino, Squarespace

And my fellow QCI Instigators at Apple for your past wisdom—
you know who you are!
(And you know that I know who you are!)

History

2023-01-12: Presented to Aetion at the invitation of John Turek, a former Google colleague.

2023-01-17: Presented to Microsoft at the invitation of Ono Vaticone, a former Apple colleague and Quality Culture Initiative member.

2023-03-10: Presented an updated, abridged version at the DevOps Enterprise Forum in Portland, Oregon. This was the first time I incorporated more of my personal story in the beginning, using the Crossing and Chasm and Rainbow of Death models.

2023-05-23: Presented to EAB at the invitation of Isaac Truett and Brendan Mannix. I created new content to emphasize the corrosive effect that excessive, unmanaged complexity has on software quality. I added an example of how to manage complexity in the development process and in the application architecture based on my elistman project.

2023-08-17: Presented to Squarespace PTE Summit at the invitation of Alex Buccino. This version introduced extensive reorganizing of sections, along with reediting most of them to no longer assume a particular order.

TODOs

Here are some items I’m still thinking about adding to the script, most likely as footnotes:

Do Hard Things: The typical view of suffering as building “mental toughness” has the exact opposite effect.
Connect Kurt Lewin’s model of change (from The Rainbow of Death) to “focus and simplify” and the “quality culture cycle” as well.
Mention “glue work” and the need to discuss, prioritize, and recognize leadership generally.
Draw a parallel between market incentives for security as well as those for quality—which are often intertwined.
- Perspectives on the SolarWinds Incident
Add one or more references to the “Mousetrap” sequence in Hamlet:
- Instructions to the players to “hold a mirror, as ‘twere, up to nature.”
- The unsettling effect of seeing one’s own misdeed presented back to you in some form.

Footnotes

Joel Schwartzberg’s Get to the Point! Sharpen Your Message and Make Your Words Matter inspired me to articulate this clear, concise point up front. ↩
I expanded the introduction to define “software quality” and how it relates to complexity based on the request of Brendan Mannix of EAB. I presented an edited version of this talk at EAB on 2023-05-23, organized by Brendan and Isaac Truett. ↩
David Marchese’s interview with Cal Newport for the New York Times on 2023-01-23, The Digital Workplace Is Designed to Bring You Down, bears mentioning here. Newport notes that with the rise of “knowledge work”, “we fell back to a proxy for productivity, which is visible activity.” Then:

“Visible activity as a proxy for productivity spiraled out of control and led to this culture of exhaustion, of I’m working all the time, I’m context shifting all over the place, most of my work feels performative, it’s not even that useful.”

He also noted Peter Drucker’s coining of the term “knowledge work” in 1959 and the consequences for management:

“So Drucker is saying that knowledge workers need to manage themselves. Managers just need to set them up to succeed. But then what do you manage? Visible activity as a proxy for productivity was the solution. We need something we can focus on day to day and feel that we’re having a role in pushing work: Let’s just manage visible activity. It’s this compromise that held the pieces together in an imperfect way, and then in the last 20 years, this centrifuge of digital-accelerated work blew it apart. The compromise is now failing.”

So there is a danger that trying to make work visible could dissolve into productivity theatre. At the same time, Newport unpacks his concept of “slow productivity,” the topic of his next book [emphasis mine]:

“So how do you actually work with your mind and create things of value? What I’ve identified is three principles: doing fewer things, working at a natural pace,⁹ but obsessing over quality. That trio of properties better hits the sweet spot of how we’re actually wired and produces valuable meaningful work, but it’s sustainable.”

⁹ Meaning one with more variability in intensity than the always-on pace to which we’ve become accustomed.

This presentation walks the line between making visible the aspects of our work that truly speak to software quality, and superficial displays of productivity. People often want to jump straight to solutions, and start generating performative “data” to prove their value. In doing so, they fail to grasp the underlying issues and end up continuing the negative cycle of increasing effort yielding decreasing quality.

We first need to help people get a handle on the issues and understand what we need to accomplish. This is why this talk makes the case for software quality and illustrates its obstacles before discussing solutions. It’s also why the solutions offered are rudimentary guidelines and techniques for inviting nuanced discussion and developing shared understanding that grows over time.

The punchline being, in the end, improving software quality is about leadership far more than it is about technology. Leadership requires helping people clearly see principles in action and getting results, so that they may learn from the example and achieve similar success. Hence, though making quality work visible may remain an imperfect practice involving trade-offs and compromises, it’s essential to improving software quality broadly across organizations. ↩
This was after getting a theatre degree from Christopher Newport University and quitting my band, The Prime Ministers of Audiophonics. I ended up working in the shipping department of Optima Graphics, just outside of St. Louis, before going back to CNU for computer science. After leading a successful protest to save the Computer Science and Environmental Science grad programs, I finished my CS degree, and that was that. ↩
In Automated Testing—Why Bother?, I define automated testing as: “The practice of writing programs to verify that our code and systems conform to expectations—i.e. that they fulfill requirements and make no incorrect assumptions.” ↩ ↩²
Also see my blog post “Coding and Testing at Google, 2006 vs. 2011.” ↩
Googlers: My Percent score was over 90% when I left Google in September 2011. ↩
The Crossing the Chasm model can be traced back to Everett Rogers’s Diffusion of innovations model from 1962. That model differentiated the five populations, but lacked a “chasm.” The chasm was added by Lee James and Warren Schirtzinger of Regis McKenna Inc., where Moore also worked.

Articles that dig into the Chasm’s history include:
- Crossing the Chasm Summary
- Chasm Theory Development: The Complete History
The first article above presents a number of criticisms of the Crossing the Chasm model. Like criticisms of the Test Pyramid model, I think they split hairs and miss the point. Not because their points aren’t valid, but because they’re better presented as further refinements for consideration after grasping the concept, not criticisms of the model.

No model is perfect, but a good one is at least effective at bringing new people into the conversation. Once they’re in, and comfortable with the concepts and the language, we can point out nuances not captured by the model. But without the model, people may not gain access to the conversation to begin with. ↩
I’ve had people suggest that Laggards are actually the dominant population, comprising the actual majority. I remind them that it only seems that way—they’re the most vocal because they feel they have something to lose. Once both Majorities adopt an innovation, their voices lose power. ↩
I do have one story about engaging a Laggard that had a genuinely happy, “everybody wins” ending. During my first Test Mercenaries engagement, one member of the client team who was not a manager was clearly angling to become one. This person thought automated testing was a waste of time, and opposed what my Mercenary partner and I were there to do.

At some point, I proposed a chat with this person out on the second floor patio at the east end of Building 43. We were frank with one another about what our problems were—but then, quite unexpectedly, we found common ground. One of this person’s primary concerns had to do with overnight build breakages.

The Mountain View team would have working code (or, at least, code that successfully built) checked in at the end of the day. The team on the other side of the world would check in new changes overnight. The Mountain View team would pull the changes in the morning, only to find that the code failed to build. They spent time having to fix the breakages before getting to their own work, and the cycle continued.

At that point, I said one of the first things we could do would be to set up a Chris/Jay Continuous Build system. Then we’d enforce a policy that the build must remain green at all times, and any breakages must be fixed or rolled back immediately. These tasks happened to be part of the Test Certified program the Mercenaries were there to help the team adopt. It also happened that we could make immediate progress on these tasks without even requiring anyone to write tests at this point.

We were in full agreement, and went to work making it happen. I got the continuous build system working, the team adopted the “no breakages” policy, and the rest of the engagement continued successfully.

About a year or two later, I got an email from my erstwhile nemesis, who’d moved on to another project within the company. Much to my surprise, it was a thank you email—my nemesis had since become a strong automated testing advocate.

This kind of experience is the exception rather than the norm. My standard advice is to ignore Laggards and give them a wide berth. But sometimes you get unlucky such that you can’t get around them—but then, you might get lucky and find more in common than you’d expected. It’s amazing how shared objectives and shared success can bring about positive changes in people and relationships.

By the way, this event also inspired me to organize the Revolution Fixit in January 2008. I’ll cover that story in a later footnote. ↩
Albert Wong, former Googler and member of the U.S. Digital Service. I saw his original model in his presentation on his early work as a member of the USDS, working with Citizenship and Immigration Services. In my mind, I instantly saw it snapping into the Chasm—and helping me make sense of the Google Testing Grouplet’s story.

I asked Albert if I could borrow the model, and he agreed. I also asked if he minded me giving it a funny name, and he didn’t.

The multicolored span of the model reminds me of rainbow, and my weird sense of humor inspired me to pair it with an incongruous concept. Hence, “The Rainbow of Death.”

Two years after I started using the model, I realized how the concept of “Death” actually fits. The model helps explain how the problem you want to solve may not be the problem you have to solve first. To achieve that insight, old ideas about the problem and what’s required to solve it have to die to make room for new ideas.

For example, the Testing Grouplet wanted to improve automated testing and software quality—but we had to figure out how to sell others on it. We eventually realized we needed to do more than train new hires once, host tech talks, and give out books. We kept doing those things, but we couldn’t only continue putting information out there in the hopes that people would use it. We realized we needed to get people more directly engaged—leading to Testing on the Toilet, Test Certified, the Test Mercenaries, and a series of Fixits. Our work also influenced build and testing infrastructure development, culminating in the launch of the Test Automation Platform.

More to come in a following footnote… ↩
This was how Testing Grouplet co-founder Bharat Mediratta described the choice we had to make about how to operate. ↩
The Revolution Fixit was the third Google-wide testing Fixit I organized, helping set up the TAP (Test Automation Platform) Fixit two years later. This event introduced Google’s now famous cloud based build and testing infrastructure to projects across the company. I named it after my favorite Beatles tune (tied with “I Am the Walrus”), leading to spectacularly Beatles-themed announcements, advertisements, prizes, etc.

My aforementioned experience setting up the continuous build system after engaging with a Laggard became the genesis for the Revolution. Because the standard tools at the time were so slow and painful to use, I tried using a couple of experimental tools in the build. SrcFS was an officially supported source control cache that hadn’t yet been released for general use. Forge was a distributed build and test execution system developed by someone outside of the tools team as a 20% project. Together, these tools immediately brought the full build and test cycle times from about 1h 15min to about 17min.

The only thing was, these new tools were far more strict. Each BUILD target had to fulfill two conditions in order to be able to use them:
1. All input artifacts had to be in source control. At the time, because the existing source control system was overloaded and slow, many projects maintained larger artifacts in NFS. While some artifacts might’ve been “version controlled” via naming conventions, there was no enforcement of this, automated or otherwise, rendering verifiably reproducible builds effectively impossible. (We also called these “non-hermetic” builds, as opposed to “hermetic” builds in which all inputs were properly version controlled.)
  
  In order for Forge to successfully build a project, all of its inputs had to be accessible via source control only, full stop. It couldn’t access NFS even if it wanted to.
2. The dependency graph had to be completely specified. Many projects at the time contained two common yet trivially fixable dependency problems:
  1. Undeclared data dependencies for build and test targets. This was sometimes related to not having all input artifacts in source control, but many times they already were.
    
    The fix was to add the necessary declarations to the affected build or test targets that should’ve been specified to begin with.
  2. C++ headers that were not #included directly in the files that needed them. My memory has grown a little fuzzy regarding exactly how this broke Forge builds, but I’ll try to recall it as best I can.
    
    Forge would scan each source file for the headers it needed to build and ship them to the remote build machine. However, I don’t think the scan was as complete as the C preprocessor, at least not at that time. Either that, or the SrcFS-aware Forge frontend restricted the search paths more so than the previous build frontend. At any rate, Forge builds worked perfectly well if all necessary headers were #included directly where they were needed. However, if a file only indirectly #included a header containing definitions used directly within that file, the Forge scan might not find it. Then, if any of the missing headers weren’t standard system or compiler headers available on the remote system, the build would break.
    
    The fix was to add the necessary #include declarations to the source files and any necessary package dependencies to the BUILD targets for those files. Again, these declarations and dependencies should’ve been specified to begin with.
  The officially supported tools at the time unintentionally allowed these dependency problems to pass through for years. Forge, however, had to be far more strict, since it had to ship everything it needed for building and execution to remote machines.
  
  Also, in both cases, the fixes to get the code building in Forge were trivial, by explicitly declaring existing dependencies.
Granted, there were other obstacles to using Forge for some builds and tests that took more time to resolve. However, failure to meet these basic conditions was the most common issue by far. Satisfying these conditions proved relatively quick and easy, and created more time and space to tackle the larger problems eventually.

Since I’d already run two company wide Testing Fixits (in 2006 and 2007), I immediately saw an opportunity to lead another new Fixit. Hence, the Revolution Fixit, which rolled out SrcFS, Forge, and also Blaze (the original name for Bazel). Between these three, we could realize the dream goal of “five minute (or less) builds” for every project. Once projects checked their build inputs into source control, and properly declared their dependencies, then suddenly their build cycles dropped from hours to minutes.

Also, solving these dependency problems and speeding up build and test execution times suddenly made a lot of code much easier to test.

Faster builds plus two years of publishing Testing on the Toilet and promoting Test Certified dramatically weakened the “I don’t have time to test” excuse. People had gradually learned what the “right thing” was with regard to testing by that point, and finally had the power to do it.

One of the neatest things about the Revolution was that, for weeks afterwards, I would hear people talking about “Revolutionizing” their builds. Even though not every project participated fully on the Fixit day, within a year, every project had migrated to the new infrastructure. I compare the before and after effects in Coding and Testing at Google, 2006 vs. 2011.

In the two years after the Revolution:
- Other Testing Grouplet members organized a three month, interoffice Test Certified Challenge which greatly increased participation.
- The Build Tools team organized a Forgeablity Fixit to try to resolve all remaining Forge blockers.
- Then I organized the Test Automation Platform Fixit, essentially concluding the Testing Grouplet’s primary mission after five years.
I hadn’t written in detail about either the Revolution or the TAP Fixit before I ran out of steam writing about Google in 2012. Time has passed, and many memories have faded, but one day I may yet try to share more of what I’m able.

Again, one can’t always expect such productive, high impact outcomes as a result of engaging with a Laggard. It’s even possible I might’ve stumbled upon the idea for the Revolution eventually even without that encounter. But having had the encounter, it demonstrates the virtue of finding common ground, versus bearing a grudge and working against one another out of spite. ↩
Thanks in large part due to the current TotT coordinator, Andrew Trenk. ↩
This included Jake Spracher and Kirk Russell, who both left Apple before I did. The other folks are still at Apple, and therefore I’ll leave them anonymous for now. ↩
After hitting the wall for about the third time, at Apple, I eventually realized the Rainbow of Death wasn’t an answer key, but a blueprint. Yes, it can help trained experts understand what the finished structure looks like. However, it has to come together over time, with many adjustments along the way. You have to find and purchase a site, prepare the site, put in the framing, then the electrical and plumbing infrastructure, etc. You can’t have the bulldozers, construction workers, roofers, siders, painters, and interior decorators all start at the same time. And the assumption is they are all already knowledgeable in what they need to do, and they’re already bought into doing it.

Spreading adoption of good testing practices has its own order of dependencies—and you have to provide education and secure buy in as you go. Sharing the Rainbow of Death is fun and useful for existing Instigators, especially after completing the mission, showing how years of chaos converged into achievement. But it’s not the most effective tool for recruiting new Instigators and influencing the Early Majority. There really aren’t any shortcuts; it’s always going to take time.

In other words, my own idea about how to approach the problem using the Rainbow of Death needed to die, so new ideas could emerge. Specifically, I needed to set aside the complexity of the Rainbow of Death, and embrace the “focus and simplify” principle as a starting point instead. ↩
An Apple internal article used the example of Amundsen and Scott’s expeditions to the South Pole to illustrate the need to “focus and simplify.” Amundsen focused on getting there with the best sled dogs, succeeded on 1911-12-14, and survived. Scott tried a diversified approach, and did reach the South Pole on 1912-01-17, but he and his crew all died during the return trip.

Other articles outside Apple highlight other differences in the mindset and leadership styles between the two. Amundsen was adaptable to conditions beyond his control, learned from the wisdom of others, assembled the most skilled team possible, and paid attention to details. Scott didn’t heed the weather, was casual about team composition and details, and plowed ahead through sheer assertion of confidence. Like Feynman later warned, nature will not be fooled by public relations.
- The Leadership Lessons of the Race to the South Pole
- South to the Pole: Leadership Wins the Race
↩
This is similar to Jim Collins’s Flywheel Effect. ↩
To recap the earlier footnote: I wrote The Rainbow of Death to describe how the Testing Grouplet made focused, deliberate progress over time. Ironically, I then kept trying to use the model in several organizations to launch a bunch of efforts in parallel from the very start. Thankfully, I finally learned my lesson at Apple. ↩
Greg Mckeown calls this “making a millimeter of progress in a million directions” in his book Essentialism: The Disciplined Pursuit of Less. ↩
Of course Martin Fowler is famous for popularizing the term “refactoring” thanks to his book Refactoring: Improving the Design of Existing Code. He defines the term specifically thus (on the refactoring.com page):

“Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

“Its heart is a series of small behavior preserving transformations. Each transformation (called a ‘refactoring’) does little, but a sequence of these transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each refactoring, reducing the chances that a system can get seriously broken during the restructuring.”

The spirit of this is encapsulated by a famous tweet from Kent Beck:

“for each desired change, make the change easy (warning: this may be hard), then make the easy change”

— https://twitter.com/kentbeck/status/250733358307500032

Also note on the refactoring.com page that Martin specifically asserts that “Refactoring is a part of day-to-day programming” and goes on to describe how. In the book, Martin gives this advice to those who still feel they need to ask for permission before refactoring anything:

“Of course, many managers and customers don’t have the technical awareness to know how code base health impacts productivity. In these cases, I give my most controversial advice: Don’t tell!

“Subversive? I don’t think so. Software developers are professionals. Our job is to build effective software as rapidly as we can. My experience is that refactoring is a big aid to building software quickly.”

↩
I learned about this specifically from Wolfgang Trumler, who I believe got it from Joshua Kerievsky’s book Refactoring to Patterns. ↩
Remember from earlier that Apple’s goto fail bug was hidden by six copies of the same algorithm in the same file. To see how unit testing discipline could’ve caught or prevented this by discouraging duplication, see my article “Finding More Than One Worm in the Apple.” This example also illustrates that the Don’t Repeat Yourself (DRY) principle isn’t a purely academic concern.

There’s a school of thought that suggests duplication is OK before landing on the correct abstraction. I consider this dangerous advice, because it’s so easily misunderstood and used to justify low standards. Programmers are notorious for taking shortcuts in code quality in order to move onto the next new thing to work on. They’re also notorious for using any available rationale to justify this behavior, and often disparaging more thoughtful approaches as “religion.” (Not that some can’t get carried away in the opposite direction—but it’s more common to find programmers attacking “religion” than programmers that are certifiable zealots.)

I understand the utility of duplicating bits of code in one’s private workspace while experimenting with a new change. However, I think the fear of the potential costs of premature abstraction are overblown. The far, far greater danger is that of “experimental” duplication getting shipped, leading to hesitation to change shipping code. Instead of the “hasty abstraction” getting baked in, dangerous duplication gets baked in instead.

After all, a premature abstraction should prove straightforward to reverse. Working with it should quickly reveal its shortcomings, which suggest refactoring it or breaking it apart in favor of duplicating its code for some reason. If it wasn’t premature, then making changes to the only copy is less time consuming and error prone than having to update multiple copies.

Replacing duplication with a suitable abstraction after the fact should be easy, but it gives cover to potentially unnoticed bugs in the meanwhile. Again, goto fail illustrates how easy it is to miss bugs in duplicate code. Once you’ve seen the first copy, the rest tend to look the same, even if they’re not. Our brains are so eager to detect and match patterns that they trick us into skipping over critical details when we’re not careful. (I believe this is because we process duplicate code with “System 1” thinking instead of more expensive “System 2” thinking, per Thinking, Fast and Slow.) ↩
We all know a 50 line code change is generally much faster and easier to review than a 500 line change. (500 lines of new or changed behavior, that is—500 lines of search and replace or deleted code is different.) Encouraging smaller reviews encourages decomposing larger changes into a series of smaller ones that can be independently tested, reviewed, and merged. This enables more thorough reviews, faster and more stable tests, and higher long term code quality and maintainability.

Even so, some hold onto the dated belief that one should submit entire feature changes at once to avoid “dead code.” The thinking, I suppose, is that one risks introducing unused code if a larger change is introduced one piece at a time. The value judgment seems to be that unused code is a greater risk to quality than, say, poorly tested code.

This, however, increases the risk of checking in “deadly code” that contains a bug that could harm users in some way. This is because larger changes are generally more difficult to test and review thoroughly. Overcompensating for poor design sense, poor communication, poor code quality, and poor process by mandating ill advised all-at-once changes can’t overcome those issues. In fact, it all but guarantees their perpetuation. ↩
Of course, you’ll hear people make some variation of the excuse “It’s just test code” for writing sloppy tests. However, if the tests are there to ensure the quality and readiness of the production code, then the tests are part of our production toolchain. If a test fails, it should halt production releases until we’ve aligned the reality of the system’s behavior with our expectations (like Toyota’s andon cord). If a failure doesn’t warrant a halt in production, the test is a waste of resources (including precious developer attention) and should be removed. As such, our tests deserve as much respect and care as any other part of our value-creating product or infrastructure. ↩
In Working Effectively with Legacy Code, Michael Feathers defines “legacy code” thus:

“To me, legacy code is simply code without tests.”

—Preface, p. xvi

His rationale, from the same page:

“Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or how well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.”

Of course, we can also change our code while preserving its behavior quickly and verifiably, for the purpose of refactoring. ↩
Feathers further explains that “seams” are where we can change the behavior of code without changing the code itself. There are three kinds of seams:
- Preprocessor seams use #define macros to rewrite the code in languages that use the C preprocessor.
- Link seams use a static or dynamic linker, or the runtime loader, to change how a program binary is built or run. Examples include manipulating the LD_LIBRARY_PATH or CLASSPATH environment variables (or their equivalents in other languages’ build and runtime environments).
- Polymorphic seams rely upon dependency injection to build an object graph at runtime. This allows the program itself to choose which implementations to include—such as test programs using test doubles to emulate production dependencies.
Polymorphic seams are the most common and most flexible kind, as well as the first one we reach for to write testable code. The term is essentially synonymous with “dependency injection.” Preprocessor and link seams aren’t as flexible, scalable, or easy to use, but can work if you have no reasonable opportunity to introduce polymorphic seams.

Note that using any seam successfully depends on the quality of the interface that defines it. The upcoming Scott Meyers quote speaks to that.

See the footnotes from The Test Pyramid on test doubles and internal APIs for some details on the benefits of dependency injection. ↩
I first started using electrical outlets as an example in Automated Testing—Why Bother?:
First, we need to understand the fundamental building block of testable code: Abstractions, as defined by interfaces. We create abstractions every time we write a class interface, or a module interface, or an application programming interface. And these abstractions perform two powerful functions:
- They define a contract such that certain inputs will produce certain outputs and side-effects.
- They provide seams between system components that allow for isolation between components.
My favorite example of a powerful interface boundary is an electrical outlet. The shape of the outlet defines a contract between the power supplier and the power consumer, which remain thoroughly isolated from one another beyond the scope of that physical boundary.²¹ It’s easier to reason about both sides of the interface than if the consumer was wired directly into the source.

In software, problems arise when we fail to consider one or the other of these functions, when either the contract isn’t rigorously defined and understood, or when the interfaces don’t permit sufficient isolation between components. This often happens when we fail to design our interfaces intentionally.

In contrast, the more intentional our interfaces, the more natural our abstractions and seams. Automated testing obviously serves to validate the constraints of an interface contract; but the process of writing thorough, readable, reliable tests also encourages intentional interface design. “Testable” interfaces that enable us to exercise sufficient control in our tests tend to be “good” interfaces, and vice versa. In this way, testability forces a host of other tangible benefits, such as readability, composability, and extensibility.

Basically, “testable” code is often just easy to work with!

²¹ For a list of electrical plug and outlet specs used across the world, see: https://www.worldstandards.eu/electricity/plugs-and-sockets/
Put more concretely: the power source could be anything from coal to wind, hydro, solar, or a hamster wheel. The consumer could be a lamp, a computer, or a wall of Marshall stacks. The shape of the outlet should ensure the voltage and amperage matches such that neither side cares what’s on the other—it all just works! A fault, failure, or other problem on one side won’t usually damage the other, either. This is especially true given common safety infrastructure such as surge protectors, fuses, and circuit breakers. Plus, you can use an electrical outlet tester as a test double to detect potential wiring issues.

It also greatly simplifies debugging (also sampled from my 2022-12-23 email with Alex Buccino):
- If a plugged in device stops working, but the lights are still on in your house/building, you can check a few things yourself. You can see if it’s unplugged, if a switch was flipped, if a fuse/breaker blew, or if the device itself is faulty. You can pinpoint and fix most of these issues quickly, with no need to worry about the electrical grid.
- However, if all the lights went off in your house at the same time, the problem’s beyond your control. Unless you work for the electric company, you should be able to trust that the company will send a crew to resolve the issue shortly.
- Were the device wired into the electrical system directly, however, your debugging and resolution would be more costly and risky. Also, the delineation of responsibility between yourself and the electric company might not be as clear.
The common electrical outlet is a remarkably robust interface that unleashes enormous productivity every day—imagine if software in general was even remotely as reliable! ↩
This is a paraphrase of a similar statement by my former colleague Max Goldstein. ↩
I’m using “little-a agile” here, though this description certainly applies to the “capital-A Agile” methodology. I’m also reminded of the adage popularized by Dwight D. Eisenhower, “Plans are worthless, but planning is everything.”

The Quote Investigator page for the Eisenhower quote has a great summary as well:

“The details of a plan which was designed years in advance are often incorrect, but the planning process demands the thorough exploration of options and contingencies. The knowledge gained during this probing is crucial to the selection of appropriate actions as future events unfold.”

↩
For details on how that came about, see the “Man on the Moon” section of my Test Mercenaries post. ↩
The name “Quality Quest” was borrowed from an earlier program called “Test Quest,” a gamified contest run over the course of a month. ↩
Recall that I defined “making software quality visible” as “providing meaningful insight into quality outcomes and the work necessary to achieve them.” ↩
I find the insights from The Story Grid to be very helpful when it comes to thinking through the process and mechanics of storytelling. Though it’s not all directly applicable to technical storytelling, some concepts definitely translate, such as “core need/value,” “controlling idea,” and “objects of desire.” For a concise overview, see 1-Page Book Plan: The Story Grid Foolscap. ↩
This advice is also congruent with Joel Schwartzberg’s Get to the Point! Sharpen Your Message and Make Your Words Matter. For more background on this particular phrase and the spelling of “lede,” see: Why Do We ‘Bury the Lede?’ The article’s apt subtitle is “We buried ‘lead’ so far down that we forgot how to spell it.” The introductory summary states:

“A lede is the introductory section in journalism and thus to bury the lede refers to hiding the most important and relevant pieces of a story within other distracting information. The spelling of lede is allegedly so as to not confuse it with lead (/led/) which referred to the strip of metal that would separate lines of type. Both spellings, however, can be found in instances of the phrase.”

↩
This phrase was my extemporaneous response to a Slack comment during a major QCI presentation inside Apple. The commenter asserted, roughly, that the presentation had been abstract to that point, and they were waiting to be told what to do, please. I responded, “Practices need principles. We’re getting there.”

More generally, the commenter’s apparent posture is a big part of why software quality issues continue to plague society. As a species, especially in the Internet era, we’re programmed to favor “System 1” thinking to jump to using the nearest shortcut.

If we’ve already embraced the right principles and mindset, or absorbed the best possible examples earlier in our career, this is less of a problem. In that case, it may be more efficient to go straight to the examples if we already “get it.” My former Test Mercenaries colleague Paul Hammant suggests this in his Tutorials vs. Reference Docs vs. Examples blog post:

“The more experienced developers get, the more likely they are to leave tutorial and api-doc as a way of gaining knowledge of a thing, and more toward examples.”

But if we haven’t been exposed to good practices and the principles behind them already, we need at least some deliberate context building first. (Sadly, this is still the most common case, apparently.) When we see a new practice that runs counter to all the examples we’ve seen before, we need a little preparation first. Otherwise we may reflexively dismiss a potentially valuable new practice as pointless nonsense, absent sufficient context and insight to understand the problem it solves.

BTW, earlier I referred to the commenter’s “apparent” posture, because I knew this person already “gets it.” I was a little surprised in the moment, but we worked out a fuller understanding later. Regardless, the commenter may not’ve found reviewing the principles personally useful, but many others were likely hearing them for the first time. Or if they have, I find there’s still value in hearing different people play the same tune in their own unique voice. ↩
I picked up on using the term “mindset” deliberately and frequently after a chat with an executive that once helped me get hired. Once he said it, I knew that was a concept and a term that bore repeating early and often. (I was surprised I hadn’t thought to do so earlier!) After all, you can have all the knowledge and tools in the world and still be stuck, but with the right mindset, almost anything’s possible. ↩
Originally we tried to come up with Quality Quest levels without directly referencing the exact Test Certified requirements. When that didn’t really work out, I copied the Test Certified requirements from my Testing on the Toilet blog post into a Confluence page. Then I asked everyone “What do we need to change to make this work for Apple?” The version we came up with then proved far more successful. ↩
My former Google colleague Alex Buccino made a good point during a conversation on 2023-02-01 about what “delight” means to a certain class of programmers. He noted that it often entails building the software equivalent of a Rube Goldberg machine for the lulz—or witnessing one built by another. I agreed with him that, maybe for that class of programmers, we should just focus on “clarity” and “efficiency”—which necessarily excludes development of such contraptions. ↩
Unit testing existed before the World Wide Web, and the Test Pyramid has existed for years. (Evidence of the former comes later in this presentation; evidence of the latter is in a Test Pyramid footnote.) We organize individual lines of code via sequence, selection, iteration, functions, classes, modules, etc. At this molecular level, most code is essentially more alike than different. There are plenty of tools, frameworks, documentation, and working examples to help developers write their own automated tests. From this perspective, if one embraces the Quality Mindset, there is no code that’s “too hard to test,” modulo careful refactoring to make it testable.

This doesn’t mean that higher level, emergent properties of an application may prove complex and challenging to test, well beyond the scope of smaller tests. This is why we must spare human creativity and bandwidth for such challenges by automating as much lower level testing as possible. (We’re still a very long way from being able to trust artificial intelligence with this, if we ever will be. (Color me skeptical of that.) Even if we could trust AI with such tasks, as professionals and as humans, we’d remain completely responsible for the results, as we are today.)

As for code that’s “too trivial to test,” anyone who’s spent significant time testing their own code doesn’t believe such code actually exists. Maybe early prototypes and very small, straightforward programs that stand alone and aren’t on the critical production path are safe to leave untested. Pure setters that only assign a value with no other logic, and pure getters that only return a value, aren’t worth testing specifically. Beyond that, those of us who test our own code regularly know from experience how often “trivial” code causes tests to fail. In fact, “trivial” code can be amongst the most dangerous to leave untested; it receives less scrutiny because we assume it just works. Add enough “trivial” needles to a haystack of untested code, and you’ll eventually end up with a stack of needles.

“An absence of awareness [of] or belief [in]” these automated testing principles and practices is an impediment to positive change. This is a consequence of one of my favorite concepts from Saul Alinsky’s Rules for Radicals, which I paraphrased in Automated Testing—Why Bother?:

If people believe they lack the knowledge and power to solve a problem, they won’t even think of trying to solve it.

↩
This idea was also inspired by Immunity to Change by Robert Kegan and Lisa Lahey.

Actually…I’ve yet to start the book at the time I’m writing this sentence. I learned about it from Harvard expert on the worst thing about New Year’s resolutions—and how to beat it: ‘A profound loss of energy’ (CNBC, 2022-12-31). That article quotes Lahey’s four step process to “breaking our resistance to change”:
1. Identify your actual improvement goal, and what you’d need to do differently to achieve it.
2. Look at your current behaviors that work against your goal.
3. Identify your hidden competing commitments.
4. Identify big assumptions about how the world works that drive your resistance to change.
Assumptions, until identified, are essentially unspoken or unconscious beliefs. ↩
In the interview from an earlier footnote, The Digital Workplace Is Designed to Bring You Down, Cal Newport makes a relevant observation to this point:

“If we look through the history of the intersection of technology and commerce, we always see something similar, which is: When disruptive technology comes in, it takes a long time to figure out the best way to use it. There’s this case study⁵ from a Stanford economist about the introduction of the electric motor into the factory. He characterizes how long it takes before we figure out what was in hindsight the obvious way to use electric motors in factories, which was to put a small motor in every piece of equipment so I can control my equipment at exactly the level I want to use it. We didn’t do that for 20 or 30 years.”

⁵ Paul A. David’s “Computer and Dynamo: The Modern Productivity Paradox in a Not-Too-Distant Mirror,” published in 1989.

In other words, known solutions still take time to sink in and become so obvious and easy to use that they become common practice. So maybe we’re approaching the tipping point as I write this sentence on January 23, 2023, as unit testing is over 30 years old. ↩
In Software at Scale 53 - Testing Culture with Mike Bland, I discuss with Utsav Shah why effective testing practices haven’t yet caught on everywhere. It seems part of the human condition is that wisdom passed down through the ages still requires that individuals seek it out. Good examples and teachers can help, but those aren’t always accessible to everyone, at least not without some self-motivated effort to find them.

By way of analogy, I mentioned just having read the Bhagavad Gita. In it, the warrior Arjuna struggles with the prospect of going to war against his own family. The supreme being, Krishna, then convinces him that doing so is his duty—which is quite shocking by today’s standards. However, read as only a metaphor for profound internal conflict and doubt that was accessible to the audience of the day, the message is reassuring. But it takes a willfully open mind to derive such value.

On top of that, one of the main lessons is that one should feel attached to doing one’s work—but not to the outcomes. This is a pretty common theme, also running through The Daily Stoic, which I’d also recently finished. Other traditions, notably Buddhism and Taoism, also teach detachment from outcomes and other things beyond one’s control generally.

However, despite this message being developed in multiple ancient cultures and spreading throughout history, tradition, and literature, people still struggle with attachment to this day. The essence of such wisdom isn’t necessarily complicated, but it’s often obscured by other natural preoccupations of both individuals and cultures.

This doesn’t contradict Cal Newport’s observation above on the time it takes for organizations to assimilate new technologies. It perhaps helps explain, at least in part, why it takes so long. ↩
The Egg of Columbus parable is my favorite illustration of this principle.

As the story goes, when Columbus attended a party with Spanish noblemen after discovering the New World, they disparaged his accomplishment as obvious and inevitable. In response, Columbus challenged them to stand an egg on its end. When none of them could, he showed them how, by tapping and slightly breaking one end first.

The point is that some solutions only seem obvious, trivial, and inevitable after someone shows them to you—often directly to you, not related secondhand. Until then, it’s only natural to struggle with certain tasks and to believe them impossible. ↩
Thanks to Scott Boyd for reminding me to emphasize the Test Pyramid as a key component of the testing conversation. ↩
Reproducing my footnote from Automated Testing—Why Bother?: Nick Lesiecki drew the original testing pyramid in 2005. No idea if there was prior art, but he didn’t consult it.

The pyramid was later popularized by Mike Cohn in The Forgotten Layer of the Test Automation Pyramid (2009) and Succeeding with Agile: Software Development Using Scrum (2009). Not sure if Mike had seen the Noogler lecture slide or had independently conceived of the idea, but he definitely was a visitor at Google at the time I was there. ↩
The Testing Grouplet introduced the Small, Medium, Large nomenclature as an alternative to “unit,” “integration,” “system,” etc. This was because, at Google in 2005, a “unit” test was understood to be any test lasting less than five minutes. Anything longer was considered a “regression” test. By introducing new, more intuitive nomenclature, we inspired productive conversations by rigorously defining the criteria for each term, in terms of scope, dependencies, and resources.

The Bazel Test Encyclopedia and Bazel Common definitions use these terms to define maximum timeouts for tests labeled with each size. Neither document speaks to the specifics of scope or dependencies, but they do mention “assumed peak local resource usages.” ↩
Some have advocated for a different metaphor, like the “Testing Trophy” and so on, or for no metaphor at all. I understand the concern that the Test Pyramid may seem overly simplistic, or potentially misleading should people infer “one true test size proportion” from it. I also understand Martin Fowler’s concerns from On the Diverse And Fantastical Shapes of Testing, which essentially argues for using “Sociable vs. Solitary” tests. His preference rests upon the relative ambiguity of the terms “unit” and “integration” tests.

However, I feel this overcomplicates the issue while missing the point. Many people, even with years of experience in software, still think of testing as a monolithic practice. Many still consider it “common sense” that testing shouldn’t be done by the people writing the code. As mentioned earlier, many still think “testing like a user would” is “most important.” Such simplistic, unsophisticated perspectives tend to be resistant to nuance. People holding them need clearer guidance into a deeper understanding of the topic.

The Test Pyramid metaphor (with test sizes) is an accessible metaphor for such people, who just haven’t been exposed to a nonmonolithic perspective on testing. It honors the fact that we were were all beginners once (and still are in areas to which we’ve not yet been exposed). Once people have grasped the essential principles from the Test Pyramid model, it becomes much easier to have a productive conversation about effective testing strategy. Then it becomes easier and more productive to discuss sociable vs. solitary testing, the right balance of test sizes for a specific project, etc. ↩
The “confidence” concept in the context of the Test Pyramid was hammered out by Nick Lesiecki, Patrick Doyle, and Dominic Cooney in 2009. ↩
Thanks to Oleksiy Shepetko for mentioning the maintenance cost aspect during my 2023-01-17 presentation to Ono Vaticone’s group at Microsoft. It wasn’t in the table at that time, and adding it afterward inspired this new, broad, comprehensive table layout. ↩
Test doubles are lightweight, controllable objects implementing the same interface as a production dependency. This enables the test author to isolate the code under test and control its environment very precisely via dependency injection.

“Dependency injection” is a fancy term for passing an object encapsulating a dependency as a constructor or function argument. Doing this instead of instantiating or accessing the dependency directly creates a seam, allowing a test double to stand in for the dependency. (Dependency injection frameworks exist for some languages, but while some may find them convenient, they’re not strictly necessary. I’ve never used any, and I inject dependencies like crazy.)

Some of what follows owes a debt to Martin Fowler’s Mocks Aren’t Stubs.

I also defined test doubles on slide 49 of Automated Testing—Why Bother?:

Test doubles are substitutes for more complex objects in an automated test. They are easier to set up, easier to control, and often make tests much faster thanks to the fact that they do not have the same dependencies as real production objects.

The various kinds of test doubles are:
- Dummy: A placeholder value with no bearing on the test other than enabling the code to compile
- Stub: An object programmed to return a hardcoded or trivially computed value
- Spy: A stub that can remember how many times it was called and with which arguments
- Mock: An object that can be programmed to validate expected calls in a specific order, as well as return specific values
- Fake: An object that fully simulates a production dependency using a less complicated and faster implementation (e.g., an in memory database or file system, a local HTTP server)
People often call all test doubles “mocks,” and packages making it easy to implement test doubles are often called “mocking libraries.” This is unfortunate, as mocks should be the last option one should choose.

Mocks can validate expected side effects (i.e., behaviors not reflected in return values or easily observable environmental changes) that other test doubles can’t. However, this binds them to implementation details that can render tests brittle in the face of implementation changes. Tests that overuse mocks in this way are often cited as a reason why people find test doubles and unit testing painful.

My favorite concrete, physical example of using a test double is using a practice amplifier to practice electric guitar:
- You can practice in relative quiet, using something as small as a Marshall micro amp, a Blackstar Fly 3, or a Mustang micro. You do this before playing with others or getting on stage to make sure you’ve got your own parts working. If anything sounds bad, you know it’s all your fault.
  
  This is analogous to writing small tests, with the practice amp as the test double. You’re figuring out immediately if your own performance meets expectations or needs fixing—without bothering anyone else.
- You can rehearse with your band using a larger, louder amplifier, like a Marshall DSL40CR, Fender Blues Junior IV, or Fender Mustang GTX100. Or perhaps you’d prefer a Marshall Studio Vintage 20 or a Paul Reed Smith HDRX 20. This enables you to work out issues with other players before getting on stage. If you’ve practiced your parts enough and something sounds bad at this point, you know something’s wrong with the band dynamic.
  
  This is analogous to writing medium tests, with the slightly larger amp still acting as a test double. You’re figuring out with your bandmates specific issues arising from working through the material together. You can start, stop, and repeat as often as necessary without burdening the audience.
- You and the band can then run through a soundcheck on stage, making sure everything sound good together while plugging into your Marshall stacks. Everyone else will be using their production gear, the full sound system, and the lighting rig, in the actual performance space. If the band is well rehearsed but something sounds wrong at this level, you know it’s specific to the integration of the entire system.
  
  This is analogous to writing large tests. You’re using the real production dependencies and running the entire production system. However, this is still happening before the actual performance, giving you a chance to detect and resolve showstopper issues before the performance.
- Finally, you play in front of the audience. Things can still go wrong, and you’ll have to adapt in the moment and discuss afterwards how to prevent repeat issues. However, after all the practicing, rehearsals, and soundchecks, relatively few things could still go wrong, and are likely unique to actual performance situations.
  
  This is analogous to shipping to production. You can’t expect perfection, and you may discover new issues not uncovered by previous practicing, rehearsing, and testing. However, you can focus on those relatively few remaining issues, since so many were prevented or resolved before this point.
Of course, there are more options than this. There’s nothing saying you couldn’t use any of these amplifiers in any other situation—you could use, say, the Fender Mustang GTX100 for everything. It can even plug directly into the mixing deck and emulate a mic’d cabinet. But hopefully the point of the analogy remains clear: The common interface gives you the freedom to swap implementations as you see fit.

The only question is, what kind of “test double” is a practice amplifier? Based on the definitions above, my money’s on calling it a “fake.” It’s a lighter weight implementation of the full production dependency, with essentially the same interface, modulo EQ and volume controls and exact vacuum tube reactivity. Even so, for most practical purposes when it comes to practicing, it’s close enough, and there’s no interface for preprogramming responses.

(I used images of Marshall stacks vs. a Marshall micro amp on slide 49 of Automated Testing—Why Bother?, but didn’t write them into the narrative.) ↩
Shoutout to Simon Stewart for being a vocal advocate of shorter feedback loops. See his Dopamine Driven Development presentation. ↩
Shoutout to Francisco Candalija for bringing contract and collaboration tests to my attention. He influenced how I now think and talk about medium/integration tests and my own “internal API” concept. (Some of the below I also described in an email to my former Google colleague Alex Buccino on 2022-12-23.)

Contract tests essentially answer the question: “Did something change that’s beyond my control, or did I screw something up?” while narrowing potential sources of error. They also help ensure our test doubles remain faithful to production behavior.

I like thinking of contract tests in this way rather than how Pact defines them, even though the Pact definition is very popular. Writing a contract test quickly using a special tool and calling it a day can provide a false sense of confidence. Such tests are prone to become brittle and flaky if one doesn’t consider how they support the overall architecture and testing strategy.

An “internal API” is a wrapper that’s kind of a superset of Proxy and Adapter from Design Patterns. It’s an interface you design within your project that translates an external (or complicated internal) dependency’s language and semantics into your own custom version. Using your own interface insulates the rest of your code from directly depending on the dependency’s interface.

One very common example is creating your own Database object that exposes your own “ideal” Database API to the rest of your app. This object encapsulates all SQL queries, external database API calls, logging, error handling, and retry mechanisms, etc. in a single location. This obviates the need to pepper these details throughout your own code.

What this means is:
- The internal API introduces a seam enabling you to write many more fast, stable, small tests for your application via dependency injection and test doubles. (Michael Feathers introduced the term “seam” in Working Effectively with Legacy Code.) This makes the code and the tests easier to write and to maintain, since the all the tests no longer become integration tests by default.
- You do still need to test your API implementation against the real dependency—but now you have only one object to test using a medium/integration test. This would be your contract test.
- Any integration problems with a particular dependency are detected by one test, rather than triggering failures across the entire suite. This improves the signal to noise ratio while tightening the feedback loop, making it faster and easier to diagnose and repair the issue.
- The contract test makes sure any test doubles based on the same interface as the internal API wrapper are faithful to production. If a contract test fails in a way that invalidates your internal API, you’ll know to update your API and test doubles based on it.
- If you want to upgrade or even replace a dependency, you have one implementation to update, not multiple places throughout the code. This protects your system against revision or vendor lock in.
- In fact, you can add an entirely new class implementing the same interface and configure which implementation to use at runtime. This makes it easy and safe to try the old and new implementations without major surgery or risk.
For all these reasons, combining internal APIs with contract tests makes your test suite faster, more reliable, and easier to maintain.

A concrete example: Like many languages, Python provides a common DBAPI. This enables you to use a local, in memory database (typically SQLite) to fake (i.e. stand in for) a production database.

I did this not long ago for some Python code that threw a DBAPI error in production every few days, locking up our server fleet:
- Though we used Postgres in prod, I could simulate the same DBAPI error in a test on my desk by using the standard sqlite3 module.
- I reproduced the bug, in which the system didn’t abort a failed transaction due to a dropped connection, blocking further operations.
  
  I wouldn’t call the test “small” or “medium,” but “small-ish.” It was as small a contract test as you could get, and while it wasn’t super fast, it was quite quick.
- I fixed the bug—and the test—by introducing a Database abstraction that implemented a rollback/reconnect/retry mechanism. The relatively small size, low complexity, and quick speed of the test enabled me to iterate quickly on the solution.
  
  (I also set a one hour timeout on database connections. This alone might’ve resolved the problem, but it was worth adding the new abstraction that provably resolved the problem.)
- I shipped the fix—and bye bye production error! I kept monitoring the logs and never saw it happen after that.
This contract test enabled me to define an internal Database API based on the Python DBAPI. The DBAPI ensures that the Database API can be reused—and tested—with different databases that conform to its specifications. The rest of our code, now using the new Database object, could be tested more quickly using test doubles. So long as the contract test passes, the test doubles should remain faithful substitutes. And if we wanted to switch from Postgres to another production database, likely none of our code would’ve had to change.

The contract test did require some subtle setup and comments explaining it. Still, dealing with one such test and object under test beats the hell out of dealing with one or more large system tests. And it definitely beats pushing a “fix” and having no idea whether it stands a chance of holding up in production! ↩
I deliberately avoid saying which specific proportion of test sizes is appropriate. The shape of the Test Pyramid implies that one should generally try to write more small tests, fewer medium tests, and relatively few large tests. Even so, it’s up to the team to decide, through their own experience, what the proportions should be to achieve optimal balance for the project. The team should also continue to reevaluate that proportion continuously as the system evolves, to maintain the right balance.

I also have scar tissue regarding this issue thanks to Test Certified. Intending to be helpful, we suggested a rough balance of 70% small, 20% medium, and 10% large as a general target. It was meant to be a rule of thumb, and a starting point for conversation and goal setting—not “The One True Test Size Proportion.” But OMG, the debates over whether those were valid targets, and how they were to be measured, were interminable. (Are we measuring individual test functions? Test binaries/BUILD language targets like cc_test? Googlers, at least back then, were obsessed with defining precise, uniform measurements for their own sake.)

On the one hand, lively, respectful, constructive debate is a sign of a healthy, engaged, dynamic community. However, this particular debate—as well as the one over the name “Test Certified”—seemed to miss the point, amounting to a waste of time. We just wanted teams to think about the balance of tests they already had and needed to achieve, and to articulate how they measured it. It didn’t matter so much that everyone measured in the exact same way, and it certainly didn’t matter that they achieve the same test ratios. It only mattered that the balance was visible within each individual project—and to the community, to provide inspiration and learning examples.

Consequently, while designing Quality Quest at Apple, we refrained from suggesting any specific proportion of test sizes, even as a starting point. The language of that program instead emphasized the need for each team to decide upon, achieve, and maintain a visible balance. We were confident that creating the space for the conversation, while offering education on different test sizes (especially smaller tests), would lead to productive outcomes. ↩
“Flaky” means that a test will seem to pass or fail randomly without a change in its inputs or its environment. A test becomes flaky when it’s either validating behavior too specific for its scope, or isn’t adequately controlling all of its inputs or environment—or both. Common sources of flakiness include system clocks, external databases, or external services accessed via REST APIs.

A flaky test is worse than no test at all. It conditions developers to spend the time and resources to run a test only to ignore its results. Actually, it’s even worse—one flaky test can condition developers to ignore the entire test suite. That creates the conditions for more flakiness to creep in, and for more bugs to get through, despite all the time and resources consumed.

In other words, one flaky test that’s accepted as part of Business as Usual marks the first step towards the Normalization of Deviance.

There are three useful options for dealing with a flaky test:
1. If it’s a larger test trying to validate behavior too specific for its scope, relax its validation, replace it with a smaller test, or both.
2. If what it’s validating is correct for its scope, identify the input or environmental factor causing the failure and exert control over it. This is one of the reasons test doubles exist.
3. If you can’t figure out what’s wrong or fix it in a reasonable amount of time, disable or delete the test.
Retrying flaky tests is NOT a viable remedy. It’s a microcosm of what I call in this presentation the “Arms Race” mindset. Think about it:
- Every time a flaky test fails, it’s consuming time and resources that could’ve been spent on more reliable tests.
- Even if a flaky tests fails on every retry, people will still assume the test is unreliable, not their code, and will merge anyway.
- Increasing retries only consumes more resources while enabling people to continue ignoring the problem when they should either fix, disable, or delete the test.
- Bugs will still slip through, introduce risk, and create rework even after all the resources spent on retries.
↩
The last thing you want to do with a flaky or otherwise consistently failing test is mark it as a “known failure.” This will only consume time and resources to run the test and complicate any reporting on overall test results.

Remember what tests are supposed to be there for: To let you know automatically that the system isn’t behaving as expected. Ignoring or masking failures undermines this function and increases the risk of bugs—and possibly even catastrophic system failure.

Assume you know that a flaky or failing test needs to be fixed, not discarded. If you can’t afford to fix it now, and you can still afford to continue development regardless, then disable the test. This will save resources and preserve the integrity of the unambiguous pass/fail signal of the entire test suite. Fix it when you have time later, or when you have to make the time before shipping.

Note I said “if you can still afford to continue development,” not “if you must continue development.” If you continue development without addressing problems you can’t afford to set aside, it will look like willful professional negligence should negative consequences manifest. It will reflect poorly on you, on your team, and on your company.

Also note I’m not saying all failures are necessarily worthy of stopping and fixing before continuing work. The danger I’m calling out is assuming most failures that aren’t quickly fixable are worth setting aside for the sake of new development by default. Such failures require a team discussion to determine the proper course of action—and the team must commit to a clear decision. The failure to have that conversation or to commit to that clear decision invites the Normalization of Deviance and potentially devastating risks. ↩
Frequent demos can be a very good thing—but not when making good demos is appreciated more than high internal software quality and sustainable development. ↩
I’ve seen comments on LinkedIn recently (as of 2023-09-18) alluding to severely misguided developer productivity measurement guidelines published by McKinsey. I don’t feel the need to go to the source or get embroiled in this controversy. However, it’s been well known to software practitioners for years that attempts to objectively measure software development productivity is fundamentally flawed. Whatever objective quantum of output anyone tries to measure, people will immediately work to game that metric to their benefit.

Consequently, when I speak about using Vital Signs to reflect productivity, I’m not talking about measuring developer productivity directly. The suite of Vital Signs can help a team ensure all the forces acting on the project and their observable consequences are in balance. Often these consequences aren’t in terms of direct output, but in terms of drag: bug counts; build and test running times and failure rates; etc. Once a team has landed on a good set of signals, any imbalance implies a potential negative impact on productivity. No one metric should dominate, making the entire suite practically impossible to game.

Also, by encouraging the team and any project managers and executive sponsors to design the suite together, Vital Signs aim to balance everyone’s practical concerns. Everyone has a legitimate perspective and set of incentives, but everyone needs to communicate clearly with one another to ensure proper team alignment and performance. When members of any particular role try unilaterally to impose measurements on the others, that adversarial relationship will produce exploitable gaps, producing failure and suffering.

As a concrete example, I can share some of my experience working on a web search team at Google. Our manager maintained a Google Sheets document with several sheets that we reviewed with our Site Reliability Engineering partners during our weekly status meetings. Every week followed roughly this pattern:
- Review all production alerts and incidents. Decide on what to do about each one, e.g., tune the alert, provision more CPU/RAM/disk resources or service instances, update code.
- Review last deployment of every system microservice, and determine if any need to be recompiled and rereleased to keep them current.
- Review list of features under development. Identify which are complete, which are still under development, and what to do if any are blocked.
- Propose new features or maintenance changes to plan for the upcoming week(s).
Each one of these items would either have or receive at least one owner. Many times an SRE member and a development team member would share ownership of items, particularly those to resolve or prevent production issues.

Though there wasn’t an executive sponsor directly involved per se, we were all working within the Google-wide Objectives and Key Results framework. All of our provisioning, reliability, and feature delivery tasks were chosen with our OKRs in mind. We didn’t explicitly discuss build and test issues in these meetings, either, because we were all constantly very responsive to our continuous build. Plus, everyone took turns on “operator duty,” during which one was production “operator” and “build cop” for the week. The operator responded to any production incidents in partnership with the designated SRE member for that week. They also ensured any continuous build failures or other issues were resolved ASAP, via rollbacks or other fixes.

The key point is that everyone was involved in discussing and deciding upon every issue, feature, and task. Communication was open and constant, and responsibility and decision making was widely shared, while ensuring each task had an owner. There was no hiding or gaming any specific metrics, because everything was perfectly visible and everyone was bought into ensuring constant balance. ↩
I’ve called this concept of collecting signals to inform decision making “Vital Signs” because I believe “data-driven decision making” has lost its meaning. As often happens with initially useful innovations, the term “data-driven decision making” has become a buzzword. It’s a sad consequence of a misquote of W. Edwards Deming, an early pioneer of data-driven decision making, who actually said:

“It is wrong to suppose that if you can’t measure it, you can’t manage it—a costly myth.”

—The New Economics, Chapter 2, “The Heavy Losses”

Over time, this got perverted to “If you can’t measure it, you can’t manage it.” (The perversion is likely because people know him as a data advocate, and are ignorant of the subtlety of his views.)

Many who vocally embrace data-driven decision making today tend to put on a performance rather than apply the principle in good faith. They tend to want to let the data do the deciding for them, absolving them of professional responsibility to thoughtfully evaluate opportunities and risks. It’s a ubiquitously accepted Cover Your Ass rationale, a shield offering protection from the expectation of ever having to take any meaningful action at all. It also a hammer used to beat down those who would take such action—especially new, experimental action lacking up front evidence of its value. Even so, often “the data shows” that we should do nothing, or do something stupid or unethical. This holds even when other salient, if less quantifiable signals urge action, or a different course of action.

My favorite metaphor for the CYA function of “data-driven decision making” is “Watermelon Status,” a report that’s green on the outside, red on the inside. (A former colleague, who I believe would wish to remain anonymous, introduced me to this concept.) This is a phenomenon whereby people close to the actual project work report a “red” status, signifying trouble. However, layers of management edit and massage the “data” such that the status appears “green” to higher level management, signifying all is well. That’s what the “decision makers” want to hear, after all.

As such, allegiance to “data-driven decision making” tends to encourage Groupthink and to produce obstacles to meaningful change. On the contrary, “Vital Signs” evokes a sense of care for a living system, and a sense of commitment to ensuring its continued health. It implies we can’t check a box to say we’ve collected the data and can take system quality and health for granted. We have to keep an eye on our system’s Vital Signs, and maintain responsibility for responding to them as required.

My visceral reaction arises from all the experiences I’ve had (using Crossing the Chasm terminology) with Late Majority members lacking courage and Laggards resisting change. I’ll grant that the Late Majority may err on the side of caution, and once they’re won over, they can become a force for good. But Laggards feel threatened by new ideas and try to use data, or the lack thereof, as a weapon. Then when you do produce data and other evidence, they want to move the goalposts.

The Early Majority is a different story altogether. I’ve had great experiences with Early Majority members who were willing to try a new approach to testing and quality, expecting to see results later. Once we made those results visible, it justified further investment. This is why it’s important to find and connect with the Early Majority first, and worry about the Late Majority later—and the Laggards never, really. ↩
I’m often asked if teams should always achieve 100% code coverage. My response is that it one should strive for the highest code coverage possible. This could possibly be 100%, but I wouldn’t worry about going to extreme lengths to get it. It’s better to achieve and maintain 80% or 90% coverage than to spend disproportionate effort to cover the last 10% or 20%.

That said, it’s important to stop looking at code coverage as merely a goal—use it as a signal that conveys important information. Code coverage doesn’t show how well tested the code is, but how much of the code isn’t exercised by small(-ish) tests at all.

So it’s important to understand clearly what makes that last 10% or 20% difficult or impractical to cover—and to decide what to do about it. Is it dead code? Or is it a symptom of poor design—and is refactoring called for? Is there a significant risk to leaving that code uncovered? If not, why keep it?

Another benefit to maintaining high coverage is that it enables continuous refactoring. The Individual skill acquisition section expands on this. ↩
As the linked page explains, the “R” in “MTTR” can also stand for “Repair,” “Recovery,” or “Respond.” However, I like to suggest “Resolve,” because it includes response, repair, recovery, and a full follow through to understand the issue and prevent its recurrence. ↩
Shoutout again to Simon Stewart and his Dopamine Driven Development presentation. ↩
SonarQube is a popular static analysis platform, but I’m partial to Teamscale, as I happen to know several of the CQSE developers who own it. They’re really great at what they do, and are all around great people. They provide hands-on coaching and support to ensure customers are successful with the system, which they’re constantly improving based on feedback. I’ve seen them in action, and they deeply understand that it’s the tool’s job to provide insight that facilitates ongoing conversations.

(No, they’re not paying me to advertise. I just really like the product and the people behind it.)

I also like to half-jokingly say Teamscale is like an automated version of me doing your code review—except it scales way better. The more the tool automatically points out code smells and suggests where to refactor, the more efficient and more effective code reviews become. ↩
I can’t remember where I got the idea, but it’s arguably better to develop a process manually before automating it. In this way, you carefully identify the value in the process, and which parts of it would most benefit from automation. If you start with automation, you’re not starting from experience, and people may resent having to use tools that don’t fit their actual needs. This applies whether you’re building or buying automation tools and infrastructure.

Of course, if you have past experience and existing, available tools, you can hit the ground running more quickly. The point is that it’s wasteful to wait for automation to appear when you could benefit from a process improvement now, even if it’s manual. ↩
He mentions the fact that cruft and technical debt are basically the same in a sidebar, but it’s not on his graph. ↩
In my talk Automated Testing—Why Bother?, I go into several more reasons why automated testing helps developers understand the system, particularly when responding to failures. These include better managing the focusing illusion, the Zeigarnik effect, the orienting response, and the OODA Loop. (I learned about all of these except for the OODA Loop from Dr. Robert Cialdini’s Pre-Suasion: A Revolutionary Way to Influence and Persuade.) ↩
This concept of a “buffer” comes from Greg Mckeown’s Essentialism: The Disciplined Pursuit of Less. ↩
This seems somewhat ironic, since he invited me to publish Goto Fail, Heartbleed, and Unit Testing Culture on his website in 2014. It doesn’t focus only on the professionalism angle, but it emphasizes it heavily. He published the “Cost” article in 2019, reflecting an apparent evolution in his thinking.

I’m not criticizing Martin, or his argument—I’m rather grateful he came up with this brilliant angle, and explained it so thoughtfully and clearly. It’s incredibly helpful to move the conversation forward. I’m just not willing to abandon the “moralistic” appeal to professionalism, either. We need both.

In fact, I’d claim that a sense of professionalism necessarily precedes sound economic arguments in general. Raw economics doesn’t care about professionalism, but pragmatic professionals have to find a way to align the economics with their professional standards. That’s exactly what Martin did with this article.

Also, though he didn’t explicitly state this, it’s possible he meant “professionalism” in terms of “quality for its own sake” or “pride in one’s work.” Whereas the angle in my article, and in the slides to follow, is “professionalism” in terms of social responsibility, which also has an economic impact. I do believe in quality for its own sake and having pride in one’s work, but that’s not the appeal I tend to make, either.

All of this said, Robert Greene’s The 48 Laws of Power advises (emphasis mine):

Law 13: When asking for help, appeal to people’s self-interest, never to their mercy or gratitude

Though that title speaks specifically about gaining someone’s favor, the general principle of appealing to someone’s self interest to motivate their behaviors holds. That said, in the “Reversal” section at the end of the chapter on Law 13 states:

You must distinguish the differences among powerful people and figure out what makes them tick. When they ooze greed, do not appeal to their charity. When they want to look charitable and noble, do not appeal to their greed.

My interpretation of this principle in this context is: Don’t go all in on either the economic argument or appeals to professionalism. Use both, and presented well, I think they serve to reinforce one another.

So while I understand why Martin has taken the position he has, I’m slightly saddened by it. Or, if he’s responding to certain aspects of professionalism without distinguishing from the others, I’m only sad that he was uncharacteristically unclear on that point. Morals aren’t the only concern, but neither should economics be—alignment between them, rather than abandonment of one for the other, yields the best outcomes. ↩
Automated Testing—Why Bother? examines a few more reasons. It includes this quote from The Rainbow of Death:

People mostly had no experience with testing outside of the slowness and brittleness of the status quo, and were under constant delivery pressure while feeling intimidated by many of their peers. Who could blame them for not testing when they couldn’t afford the time to learn?

Economists call this “temporal discounting”. Basically, if someone’s presented with an option to push a feature now without tests, or prevent a problem in the future (that may or may not happen) through an investment in testing, they’ll tend to ship and hope for the best. Combined with the fact that the ever-slowing tools made it impossible to reach a state of flow, this combination of immediate pain and slow feedback in pursuit of a distant, unclear benefit made the “right” thing way harder than it needed to be.

↩
Shortly after joining one team, I presented to my teammates my vision for improved testing adoption across the company and what it would take. One of my teammates said to me in this meeting, “…but unit testing is easy!” Caught off guard, my immediate impulse—which I didn’t catch in time—was to laugh out loud at this statement. I immediately apologized and explained that, yes, it isn’t that hard once you get used to it—but many haven’t yet learned good basic techniques. (I summarized some of these in the “Individual Skill Acquisition” slides.)

Of course, my apology meant nothing—the damage was done. This teammate and I never ended up really seeing eye to eye. Per the Crossing the Chasm model covered elsewhere in this talk, I moved on rather than continuing to engage with this Laggard. ↩
Automated Testing—Why Bother? also mentions relevant reasons from the social psychology research of Dr. Robert Cialdini. From that talk, in reference to Cialdini’s Influence: The Psychology of Persuasion and Pre-Suasion: A Revolutionary Way to Influence and Persuade:
- Social proof: We follow established norms that we perceive in the behavior of others
- Authority (Authoritah): We permit others to set our priorities and do as we’re told
- Scarcity: We act out of the fear of a closing opportunity window
- Unity: We act in the perceived best interest of others with whom we have a relationship
So if you’re on a team where testing isn’t the norm, and your boss is expecting you to meet a deadline—especially if your feature is critical to the success of the project, and/or you know you have a promotion at stake—you aren’t likely to write automated tests if you haven’t written any before. Whether you feel like writing tests would leave you vulnerable to the wrath of your manager or that of your team, or you haven’t had any interest to begin with, these forces have the effect of reinforcing the status quo.
↩
I have to admit, this rant was inspired by coming across Tim Bray’s Testing in the Twenties. (I found it by way of Martin Fowler’s On the Diverse And Fantastical Shapes of Testing, which I cited in a Test Pyramid footnote.) I strongly agree with the article for the most part (especially the “Coverage data” section), but it shits the bed with the “No religion” comments. I even agree with the main points contained in those comments. However, setting them up in opposition to “religion,” “ideology,” “pedantic arm-waving,” “TDD/BDD faith,” etc., brings an unnecessarily negative emotional charge to the argument. It would be much stronger, and more effective, without them.

Note that Bray’s article is strongly in favor of developers writing effective automated tests. That said, painting people who talk about test doubles and practice TDD as belonging to an irrational tribe (while implying one’s own superiority) is harmful. I’m sorely disappointed that this otherwise magnificent barrel full of wine contains this spoonful of sewage. (A saying I got from the “A Spoonful of Sewage” chapter of Beautiful Code.) ↩
I first learned about this concept from an Apple internal essay on the topic. ↩
The full title of chapter six is Chapter VI: An Accident Rooted in History. The data comes from [129-131] Figure 2. O-Ring Anomalies Compared with Joint Temperature and Leak Check Pressure. It lists 25 Space Shuttle launches, ending with STS 51-L. It indicates O-ring anomalies (erosion or blow-by) in 17 of the 24 launches (70%) prior to STS 51-L. In the 17 missions prior, starting with STS 41-B on 1984-02-03, there were 14 anomalies (82%). ↩
From Chapter VII: The Silent Safety Program., excerpts from “Trend Data” [155-156]:

As previously noted, the history of problems with the Solid Rocket Booster O-ring took an abrupt turn in January, 1984, when an ominous trend began. Until that date, only one field joint O-ring anomaly had been found during the first nine flights of the Shuttle. Beginning with the tenth mission, however, and concluding with the twenty-fifth, the Challenger flight, more than half of the missions experienced field joint O-ring blow-by or erosion of some kind….

This striking change in performance should have been observed and perhaps traced to a root cause. No such trend analysis was conducted. While flight anomalies involving the O-rings received considerable attention at Morton Thiokol and at Marshall, the significance of the developing trend went unnoticed. The safety, reliability and quality assurance program, of course, exists to ensure that such trends are recognized when they occur….

Not recognizing and reporting this trend can only be described, in NASA terms, as a “quality escape,” a failure of the program to preclude an avoidable problem. If the program had functioned properly, the Challenger accident might have been avoided.

↩
The NASA Office of Safety & Mission Assurance site has other interesting artifacts, including:
- Significant Incidents & Close Calls in Human Spaceflight
- Lessons from Challenger, 2021-01-04
This latter artifact is a powerfully concise distillation of lessons from the Rogers report. A couple of excerpts:
Pre-Launch
- Launch day temperatures as low as 22°F at Kennedy Space Center.
- Thiokol engineers had concerns about launching due to the effect of low temperature on O-rings.
- NASA Program personnel pressured Thiokol to agree to the launch.
Lessons Learned
- We cannot become complacent.
- We cannot be silent when we see something we feel is unsafe.
- We must allow people to come forward with their concerns without fear of repercussion.
↩
If you check out the Wilcutt and Bell presentation, and follow the “Symptoms of Groupthink” Geocities link, do not click on anything on that page. It’s long since been hacked. ↩
A further thought: Trusting tools like compilers to faithfully translate high-level code to machine code, and to optimize it, is one thing. Compilers are largely deterministic and relatively well understood. AI models are quite another, far more inscrutable, far less trustworthy instrument.

Another thought: In David Marquet’s short talk on “Greatness”, he explains what he calls “the two pillars of giving control:”
1. Technical Competence: Is it safe?
2. Organizational Clarity: Is it the right thing to do?
Maybe one day we’ll trust AI with the first question. I’m not so sure we’ll ever be able to trust it with the second. ↩
Feynman’s entire appendix is worth a read, but here’s another striking passage foreshadowing Wilcutt and Bell’s “lack of bad outcomes” assertion:

There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next.

↩
These next two statements defining “culture” are my paraphrase of a concept I discovered from an Apple internal essay. ↩
I learned this principle from an Apple internal essay. ↩
Like many, I learned many of these aspects of the power of asking good questions from Michael Bungay Stanier’s The Coaching Habit.

Also, thanks to Wolfgang Trumler for reminding me of the power of asking people what to do versus telling them what to do. ↩
This insight was inspired by discussion of the Situational Leadership II® model described in Ken Blanchard’s Leadership and the One Minute Manager. ↩
This has been my tagline for years. I think I originally used it in the text of The Rainbow of Death from March 2017. ↩
If you were paying attention to the description of the narrative arc—or at least to the agenda at the beginning—you should’ve seen this part coming! ↩
I was inspired to create this list after reading The Leadership Campaign chapter “Step 6: DEFINE Everything.” ↩