Mike Bland

Making Software Quality Visible

Copyright 2023 , licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

This presentation, Making Software Quality Visible, is my calling card. It describes my approach to driving software quality improvement efforts, especially automated testing, throughout an organization—applying systems thinking and leadership to influence its culture.

If you’d be interested in my help, please review the slides and script below. If my approach aligns with your sense of what your organization needs, please reach out to me at mbland@acm.org. (Also see my Hire me! page for more information.)

Artifacts

This Google Drive folder contains a recently reimagined, much shorter version:

This is the latest, most complete version of the slides. Click Open in Keynote to view the presenter’s notes and download a copy of the Keynote file:

This Google Drive folder contains the a copy of the original Keynote presentation file as well as a PDF.

Abstract

We’ll discuss why internal software quality matters, why it’s often unappreciated and sacrificed, and what we can do to improve it. More to the point, we’ll discuss the importance of instilling a quality culture to promote the proper mindset first. Only on this foundation will seeking better processes, better tools, better metrics, or AI-generated test cases yield the outcomes we can live with.

Introduction

I’m Mike Bland, I’m a programmer, and I’m going to talk about how making software quality visible

Software Quality must be visible to minimize suffering.

…will minimize suffering.1 By “suffering,” I mean: The common experience of software that’s painful—or even dangerous—to work on or to use. By “making software quality visible,” I mean: Providing meaningful insight into quality outcomes and the work necessary to achieve them.

Quality work can be hard to see. It’s hard to value what can’t be seen—or to do much of anything about it.

This is important because it’s often difficult to show quality work, or its impact on processes or products.2 How do we show the value of avoiding problems that didn’t happen? This makes it difficult to prioritize or justify investments in quality, since people rarely value work or results they can’t see. Plus, people can’t effectively solve problems they can’t accurately sense or understand.

Agenda

  • Google and Apple Examples
    I’ll share high-level examples of what I mean by making quality work visible to minimize suffering, based on my experience at Google and Apple.

  • What Software Quality Is and Why It Matters
    We’ll define internal software quality and explain why it’s just as essential as external.

  • Why Software Quality Is Often Unappreciated and Sacrificed
    We’ll examine several psychological and cultural factors that are detrimental to software quality.

  • The Test Pyramid and Vital Signs
    We’ll use the Test Pyramid model to specify the principles underlying a sound testing strategy. We’ll also discuss the negative effects that its opposite, the Inverted Test Pyramid, imposes upon its unwitting victims. I’ll then describe how to use “Vital Signs” to get a holistic view on software quality for a particular project.

  • Building a Software Quality Culture
    We’ll learn how to integrate the Quality Mindset into organizational culture, through individual skill development, team alignment, and making quality work visible.

  • Calls to Action
    We’ll finish with specific actions everyone can take to lead themselves and others towards making software quality work and its impact visible.

Google and Apple Examples

Now here are some very brief summaries of my experiences with making quality work and its impact visible.

Google: Testing Grouplet

2005-2010, mike-bland.com/the-rainbow-of-death

At Google I was part of the Testing Grouplet, volunteers dedicated to driving automated testing adoption. I tell the full story in my talk “The Rainbow of Death.” Here are a few highlights:3

  • Rapid growth, hiring the “best of the best,” build/test tools not scaling
    When I joined in 2005, the company was growing fast,4 and we knew we were “the best of the best.” However, our build and testing tools and infrastructure weren’t keeping up.

  • Lack of widespread, effective automated testing and continuous integration; frequent broken builds and “emergency pushes” (deployments)
    Developers weren’t writing nearly enough automated tests, the ones they wrote weren’t that good, and few projects used continuous integration. As a result, code frequently failed to compile, and errors that made it to production would frequently lead to “emergency pushes,” or deployments.

  • Training, GWS Story, Test Certified, Test Mercenaries, Testing on the Toilet
    Over time, the Testing Grouplet trained new hires, shared the Google Web Server story, and developed the Test Certified roadmap program. We built the Test Mercenaries internal consulting team, and our biggest hit was our Testing on the Toilet newsletter, appearing weekly in every company bathroom.

  • Two Testing Fixits, Revolution (Build Tools) Fixit, Test Auto. Platform Fixit
    We also ran several “Fixits,” companywide events to address “important but not urgent” issues. Our Fixits inspired people to write and fix tests, to adopt new build tools, and finally, to adopt the Test Automation Platform continuous integration system.

  • Better tests, practices, build and test times; fewer bugs, more productivity
    These efforts made quality work and its impact more visible than it had been. This helped people write better tests, adopt better testing practices and strategies, drastically improve build and test times, reduce bugs, and increase productivity. But perhaps the most visible result was scalability of the organization.

Google: Testing Grouplet results

2015, R. Potvin, Why Google Stores Bills. of LoC in a Single Repo

Rachel Potvin presented the following results in her presentation from @Scale 2015, “Why Google Stores Billions of Lines of Code in a Single Repository.” They may seem quaint to Googlers today, but they speak to the Testing Grouplet’s enduring impact five years after the TAP Fixit.

  • 15 million LoC in 250K files changed by humans per week
  • 15K commits by humans, 30K commits by automated systems per day
  • 800K/second peak file requests

Of course, the Testing Grouplet isn’t responsible for all of this; Rachel’s talk describes an entire ecosystem of tools and practices. Even so, she states very clearly that:

Also, it may amuse you to know that Testing on the Toilet, started in 2006, continues to this day!5

Apple: Quality Culture Initiative

2018-present

At Apple, I joined forces with a few others6 to start the Quality Culture Initiative, another volunteer group inspired by the Testing Grouplet.

  • Rapid growth, hiring the “best of the best,” build/test tools not scaling
    When I joined in 2018, the company was growing fast, and we knew we were “the best of the best.” However, our build and testing tools and infrastructure weren’t keeping up.

  • Widespread automated and manual testing, but…
    There was a strong testing culture, but not around unit testing.

  • “Testing like a user would” often considered most important
    With so much emphasis on the end user experience, many believed that “testing like a user would” was the most important kind of testing.

  • Tests often large, UI-driven, expensive, slow, flaky, and ineffective
    As a result, most tests were user interface driven, requiring full application and system builds on real devices. Since writing smaller tests wasn’t often considered, this led to a proliferation of large, expensive, slow, unreliable, and ineffective tests, generating waste and risk.

Apple: Quality Culture Initiative results

QCI activity as of November 2022—internal results confidential

It’s too early for the QCI to declare victory, and specific results to date are confidential. However, I can broadly describe the state of the QCI’s efforts by the time I left Apple in November 2022.

  • Training: 16 courses, ~40 volunteer trainers, ~360 sessions, ~6100 check-ins, ~3200 unique individuals
    We launched an ambitious, wildly successful, and thriving training program to spread good coding and testing practices.

  • Internal podcast: 45 episodes and 500+ subscribers
    Our internal podcast gave a voice to people of various roles and organizations across Apple to help drive the software quality conversation.

  • Quality Quest roadmap: ~80 teams, ~20 volunteer guides
    Our Quality Quest roadmap, directly inspired by Test Certified, is helping teams across Apple improve their quality practices and outcomes.

  • QCI Ambassadors: 6 organizations started, 6 on the way
    QCI Ambassadors help their organizations apply QCI principles, practices, and programs to achieve their quality goals.

  • QCI Roadshow: over 50 presentations
    The QCI Roadshow helped us introduce QCI concepts and programs directly to groups across the company.

  • QCI Summit: ~50 recorded sessions, ~60 presenters, ~850 virtual attendees
    Our QCI Summit event recruited presenters from across Apple to make their quality work and impact visible. We saw how QCI principles applied to operating systems and services, applications, frontends and backends, machine learning, internal IT, and development infrastructure.

What’s in a name?

What we realized three years after choosing it

One nice feature about the name “Quality Culture Initiative” that we didn’t realize for three years was how it encoded the total Software Quality solution:

  • Quality is the outcome we’re working to achieve, but as I’ll explain, achieving lasting improvements requires influencing the…

  • Culture. Culture, however, is the result of complex interactions between individuals over time. Any effective attempt at influencing culture rests upon systems thinking, followed by taking…

  • Initiative to secure widespread buy in for systemic changes. Selling a vision for systemic improvement and supporting people in pursuit of that vision requires leadership.

What Software Quality Is and Why It Matters

As leaders, we need to clearly define what we mean by software quality, and explain why it’s so important.

Is High Quality Software Worth the Cost?

martinfowler.com/articles/is-quality-worth-cost.html

In May 2019, Martin Fowler published “Is High Quality Software Worth the Cost?,” a brief article describing the tradeoffs and benefits of software quality.

Quality Type Users Developers
External Makes
happy
Keeps
productive
Internal Keeps
happy
Makes
productive

He distinguished between:

  • External quality, which obviously makes users happy. This, in turn, keeps developers productive, since they don’t need to respond to problems reported by users. Then Martin argues that…

  • Internal quality helps keep users happy, by enabling developers to evolve the software easily and to resolve problems quickly. This is because high internal quality makes developers productive, since there’s less cruft and unnecessary complexity slowing them down from making changes.

Effects of quality on productivity over time

Effects of quality on productivity over time Adapted from Martin Fowler's Is High Quality Software Worth the Cost?" Low internal quality High internal quality Time Cumulative functionality Weeks (not months) Technical debt

Martin also used this hypothetical graph, based on his experience, to illustrate the impact of quality tradeoffs over time.

  • With Low internal quality, progress is faster at the beginning, but begins to flatten out quickly.

  • With High internal quality, progress is slower at the beginning, but the investment pays off in greater productivity over time.

  • The break even point between the two approaches arrives within weeks, not months.

  • Though Martin’s original graph didn’t show this, the difference in productivity between low and high internal quality is one way to visualize technical debt.7

“Fast, cheap, or good: pick two three

High quality software is cheaper to produce

Martin’s conclusion is that higher quality makes software cheaper to produce in the long run—that the “cost” of high quality software is negative. “Fast, cheap, or good: pick two” doesn’t hold as the system evolves. It may make sense at first to sacrifice good to get a cheaper product to market quickly. But over time, investing in “good” is necessary to continue delivering a product quickly and at a competitive cost.

Internal quality aids understanding

Clarity increases productivity and resilience, manages risk

Internal quality essentially helps developers continue to understand the system as it changes over time:8

  • Fosters productivity due to the clarity of the impact of changes
    When they clearly understand the impact of their changes, they can maintain a rapid, productive pace.

  • Prevents foreseeable issues, limits recovery time from others
    Understanding helps them prevent many foreseeable issues, and resolve any bugs quickly and effectively.

  • Provides a buffer for the unexpected, guards against cascading failures
    These qualities help create a buffer for handling unexpected events,9 while also guarding against cascading failures.

  • Your Admins/SREs will thank you! It helps them resolve prod issues faster.
    Your system administrators or SREs will be very grateful for building such resilience into your system, as it helps their response times as well.

  • Counterexamples: Global supply chain shocks; Southwest Airlines snafu
    For counterexamples, recall the global supply chain shocks resulting from the COVID-19 pandemic, or the December 2022 Southwest Airlines snafu. These systems worked very efficiently in the midst of normal operating conditions. However, their intolerance for deviations from those conditions rendered them vulnerable to cascading failures.

  • Quality, clarity, resilience are essential requirements of prod systems
    Consequently, internal software quality, and the clarity and resilience it enables, are essential requirements of any production software system.

Focusing on internal software quality is good for business…because it’s the right thing to do.

As mentioned, Martin Fowler’s argument is that internal software quality is good for business—it’s such a compelling argument that I brought it up first. However, he prefers making only this economic argument for quality. He asserts that appeals to professionalism are moralistic and doomed, as they imply that quality comes at a cost.10

I disagree that we should sidestep appeals to professionalism entirely, and that they’re incompatible with the economic argument. I think it’s worth exploring why professionalism matters, both because it is moral and because customers increasingly expect high quality software they can trust.

Quality without function is useless—but function without quality is dangerous.

Put more bluntly, high quality may be useless without sufficient functionality, but as we’ll see, functionality without quality can be dangerous. Let’s look at a few examples.

Northrop Grumman Mission Systems

Navigation for US Coast Guard vessels or US Navy submarines

My first professional programming project was a nautical chart renderer for a navigation system used by Coast Guard vessels and Navy nuclear submarines.

  • Requirement: Enumerate chart features
    One day our product owner sent us some code to enumerate nautical chart features from a file.

  • Assumption: In memory size == on disk size
    The code assumed each record was the same size in memory as it was on disk.

  • Reality: 21 bytes on disk, 24 in memory
    However, the records were 21 bytes on disk, but the in memory structs were 24 bytes, thanks to byte padding.

  • Outcome: File size/24 == 12.5% data loss
    As a result, this code ignored one eighth of the chart features in the file.

  • Impact: Caught before shipping!
    Fortunately I caught this before it shipped to any nuclear submarines.

Apple’s goto fail

Finding More Than One Worm in the Apple, CACM, July 2014

In February 2014, Apple had to update part of its Secure Transport component…

  • Requirement: Apply algorithm multiple times
    …which applied the same algorithm in six places.

  • Assumption: Short algorithms safe to copy
    The developers apparently assumed that this short, ten line algorithm was safe to copy in its entirety, instead of making it a function.

  • Reality: Copies may not stay identical
    One problem with duplication is that the copies may not remain identical.

  • Outcome: One of six copies had a bug
    As it so happened, one of the six copies of this algorithm picked up an extra “goto” statement that short circuited a security handshake.

  • Impact: Billions of devices
    Once it was discovered and patched, Apple had to push an emergency update to billions of devices. It’s unknown whether it was ever exploited.

My article “Finding More Than One Worm in the Apple” explains how this bug could’ve been caught, or prevented, by a unit test.

OpenSSL’s Heartbleed

Goto Fail, Heartbleed, and Unit Testing Culture, May 2014

In April 2014, OpenSSL had to update its “heartbeat” feature…

  • Requirement: Echo message from request
    …which echoed a message supplied by a user request.

  • Assumption: User-supplied length is valid
    The code assumed that the user supplied message length matched the actual message length.

  • Reality: Actual message may be empty
    In fact, the message could be completely empty.

  • Outcome: Server returns arbitrary data
    In that case, the server would hand back however many bytes of its own memory that the user requested, including secret key data.

  • Impact: Countless HTTPS servers
    Countless HTTPS servers had to be patched. It’s unknown whether it was ever exploited.

My article “Goto Fail, Heartbleed, and Unit Testing Culture” explains how this bug could’ve been caught, or prevented, by a unit test.

The point is that…

Quality Culture is ultimately Safety Culture

…a culture that values and invests in software quality is a Safety Culture. Society’s dependence on software to automate critical functions is only increasing. It’s our duty to uphold that public trust by cultivating a quality culture.

Why Software Quality Is Often Unappreciated and Sacrificed

Our next step as leaders is to understand, if software quality is so important, why it’s so often unappreciated and sacrificed.

Automated (especially unit) testing

Why hasn’t it caught on everywhere yet?

  • Apple, in January 1992, identified the need to make time for training, documentation, code review—and unit testing!

At Apple, I found a document from January 1992 specifically identifying the need to make time for training, documentation, code review—and unit testing! That’s not just before Test-Driven Development and the Agile Manifesto, that’s before the World Wide Web!

There are a few reasons why unit testing in particular hasn’t caught on everywhere yet:11

  • People think it’s obvious and easy—therefore lower value
    Many developers think it’s obvious and easy, and therefore can’t provide much value.12

  • Many still haven’t seen it done well—or may have seen it done poorly
    Many others still haven’t seen it done well, or may have seen it done poorly, leading to the belief that it’s actually harmful.

  • There’s always a learning curve involved
    For those actually open to the idea, there’s still a learning curve, which they many not have the time to climb.

  • Bad perf review if management doesn’t care about testing/internal quality
    They may also fear spending time on testing and internal quality will result in a bad performance review if their management doesn’t care about it.13

Have you heard these ones before?

Common excuses for sacrificing unit testing and internal quality

Of course, people give their own reasons for not investing in testing and quality, including the following:

  • Tests don’t ship/internal quality isn’t visible to users (i.e., cost, not investment)
    “We don’t ship tests, and users don’t care about internal quality.” Meaning, testing seems like a cost, not an investment.

  • “Testing like a user would” is the most important kind of testing
    As mentioned before, “Testing like a user would” is considered most important, so investing in smaller tests and internal quality seems unnecessary.

  • Straw man: 100% code coverage is bullshit
    The straw man that “writing tests to get 100% code coverage is bullshit.” This speaks to a fundamental ignorance about how to write good tests or to use coverage the right way.

  • Straw man: Testing is a religion (implying: I’m better than those people)
    For some reason, technical people, especially programmers, like to pound their chests as being against so called testing “religion” and those who practice it. It’s a flimsy excuse for trying to score social points by virtue signaling in view of one’s perceived peers. Framing a potentially reasonable discussion of different testing ideas in such a way only serves to shut it down for a superficial, unprofessional ego boost.14

  • “My code is too hard to test.” (The Snowflake Fallacy)
    The common, evidence free “My code is too hard to test” assertion, which I call the Snowflake Fallacy.

  • “I don’t have time to test.”
    Finally, “I don’t have time to test.” This could be a brush off, or a genuine indication that they don’t know how and can’t spare the time to learn—and management doesn’t care.

Business as Usual

All these reasons are why Business as Usual persists, as well as the Complexity, Risk, Waste, and Suffering that everyone’s used to. This then allows the Normalization of Deviance to take hold.

Normalization of Deviance (paraphrased)

Coined by Diane Vaughan in The Challenger Launch Decision

Diane Vaughan introduced this term in her book about the Space Shuttle Challenger explosion in January 1986.15 My paraphrased version of the definition is: A gradual lowering of standards that becomes accepted, and even defended, as the cultural norm.

Space Shuttle Challenger Accident Report

history.nasa.gov/rogersrep/genindex.htm, Chapter VI, pp. 129-131

The key evidence of this phenomenon is articulated in chapter six of the Rogers commission report.16

  • The O-rings didn’t just fail on the Challenger mission on 1986-01-28…
    Many of us may know that the O-rings lost elasticity in the cold weather, allowing gasses to escape which led to the explosion.

  • …anomalies occurred in 17 of the 24 (70%) prior Space Shuttle missions…
    However, you may not realize that NASA detected anomalies in O-ring performance in 17 of the previous 24 shuttle missions, a 70 percent failure rate.

  • …and in 14 of the previous 17 (82%) since 1984-02-03
    Even scarier, anomalies were detected in 14 of the previous 17 missions, for an 82% failure rate.

  • Multiple layers of engineering, management, and safety programs failed

This wasn’t only one person’s fault—multiple layers of engineering, management, and safety programs failed.17 However, Normalization of Deviance isn’t the end of the problem.

NASA: NoD leads to Groupthink

Terry Wilcutt and Hal Bell of NASA delivered their presentation The Cost of Silence: Normalization of Deviance and Groupthink in November 2014.18 On the Normalization of Deviance, they noted that:

“There’s a natural human tendency to rationalize shortcuts under pressure, especially when nothing bad happens. The lack of bad outcomes can reinforce the ‘rightness’ of trusting past success instead of objectively assessing risk.”

—Terry Wilcutt and Hal Bell, The Cost of Silence: Normalization of Deviance and Groupthink

They go on to cite the definition of Groupthink from Irving Janis:

“[Groupthink is] a quick and easy way to refer to a mode of thinking that persons engage in when they are deeply involved in a cohesive in-group, when concurrence-seeking becomes so dominant that it tends to override critical thinking or realistic appraisal of alternative courses of action.”

—Irving Janis, Groupthink: psychological studies of policy decisions and fiascoes

NASA: Symptoms of Groupthink

They then describe the symptoms of Groupthink:19

  1. Illusion of invulnerability—because we’re the best!
  2. Belief in Inherent Morality of the Group—we can do no wrong!
  3. Collective Rationalization—it’s gonna be fine!
  4. Out-Group Stereotypes—don’t be one of those people!
  5. Self-Censorship—don’t rock the boat!
  6. Illusion of Unanimity—everyone goes along to get along!
  7. Direct Pressure on Dissenters—because everyone else disagrees!
  8. Self-Appointed Mindguards—decision makers exclude subject matter experts from the conversation.

Any of these sound familiar? Hopefully from past, not current, experiences.

A common result of Groupthink is the well known systems thinking phenomenon called “The Cobra Effect.”

The Cobra Effect

ourworld.unu.edu/en/systems-thinking-and-the-cobra-effect

  • Pay bounty for dead cobras
    This comes from the true story of when the British administration in India offered people a bounty to help reduce the cobra population.

  • Cobras disappear, but still paying
    This worked, but the British noticed they kept paying bounties when they didn’t see any more cobras around.

  • People were harvesting cobras
    They realized people were raising cobras just to collect the bounty…

  • Ended bounty program
    …so they ended the bounty program.

  • More cobras in streets than before!
    People then threw their now useless cobras into the streets, making the problem worse than before.

  • Fixes that Fail: Simplistic solution, unforeseen outcomes, worse problem
    This is an example of the “Fixes That Fail” archetype. This entails applying an overly simplistic solution to a complex problem, resulting in unforeseen outcomes that eventually make the problem worse.

The Arms Race

Systems thinking should replace brute force in the long run.

In software, I call this “The Arms Race.” This may sound familiar:

  • Investment to create capacity for existing practices and processes…
    We invest people, tools, and infrastructure into expanding the capacity of our existing practices and processes.

  • Exhaustion of capacity leads to more people, tools, and infrastructure
    Things are better for a while, but as the company and its projects and their complexity grow, that capacity’s eventually exhausted. This leads to further investment of people, tools, and infrastructure.

  • Then that capacity’s exhausted…
    Then eventually that capacity’s exhausted…and the cycle continues.

  • AI-generated tests may help people get started in some cases…
    There’s a lot of buzz lately about possibly using AI-generated tests to take the automated testing burden off of humans. There may perhaps be room for using AI-generated tests as a starting point.

  • …but beware of abdicating professional responsibility to unproven tools.
    However, AI can’t read your mind to know all your expectations—all the requirements you’re trying to satisfy or all the assumptions you’re making.20 Even if it could, AI will never absolve you of your professional responsibility to ensure a high quality and trustworthy product.21

  • In the end, we can’t win the arms race against growth and complexity.
    We need to realize we can’t win the Arms Race against growth and complexity.

Rogers Report Volume 2: Appendix F

history.nasa.gov/rogersrep/v2appf.htm

Richard Feynman’s final statement on the Challenger disaster is a powerful reminder of our human limitations:22

“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.”

—Richard Feynman, Personal Observations on Reliability of Shuttle

As professionals, we must resist the Normalization of Deviance, Groupthink, and the Arms Race. We can’t allow them to become, or to remain, Business as Usual.

Don’t shame people for doing Business as Usual. Help them recognize and change it.

But we can’t shame anyone for falling into these common traps, because it’s human nature to fall into them. We need to help everyone recognize them, focus on avoiding them, and remain vigilant against them, and change Business as Usual together.

Challenging & Changing Business as Usual

Changing cultural norms requires understanding them first.

Challenging cultural norms supporting Business as Usual isn’t easy, and it’s honestly frightening. There’s actually a good reason for that:

  • Everything in a culture happens for a reason—challenge reasons thoughtfully!
    The existing norms do exist for a reason. The question is whether that reason holds up today. So we need to try to understand those reasons first, then challenge them thoughtfully.23

  • People often feel invested in old ways and fear the cost of new ways.
    This is because many people feel invested in their existing methods. They fear the cost and risk of changing those methods, even if they no longer provide the value they once did. This is common human nature…

  • We’re often asked for data to prove a different approach actually helps—before trying it…
    …and is why some will ask for data or other proof that a change will be effective before trying it.

  • …while we throw time, money, people, and tools at existing processes.
    At the same time, they’ll continue throwing resources into existing processes as they have for years.

The most productive way to approach such a challenge requires taking the time to gather enough information and build some trust.

  • Challenge: Haven’t we proven that the existing ways aren’t (totally) working?
    You can then carefully question whether current methods are effective, or effective enough on their own, given substantial historical evidence to the contrary. On this basis, you may persuade some to reexamine the problem and try a different approach.

Notice that this challenge to the status quo is in the form of a thoughtful question.

The power of asking good questions

“What could we do differently to improve our software quality?”

The ultimate question is “What could we do differently to improve our software quality?”24 However…

  • Questions develop shared understanding of the culture before changing it
    …there are many more questions necessary to help everyone understand why things are the way they are, what needs to change—and how.

  • Asking encourages thinking about a problem, possible solutions
    Asking good questions includes people in the process of discovery and finding solutions, which develops their own knowledge and problem solving skills.

  • Good questions enable new information, perspectives, and ideas to emerge
    Good questions enable people to share information, perspectives, and ideas that wouldn’t otherwise arise if they were only told what to think or do.

  • Asking what to do is more engaging than telling what to do—produces buy in
    Taking the time to ask people questions pulls people into the change process, which increases their motivation to buy into any proposed changes that emerge.

This is summed up by a great Senegalese proverb shared by Bono in Chapter 20 of John Doerr’s Measure What Matters:

  • “If you want to cut a man’s hair, it is better if he is in the room.”

There is one catch, though.

If people don’t know where to begin, are stuck in old ways, or are under stress…direct near term guidance may be necessary.

Your audience may get stuck. They may currently have no idea about what to change or how—because they lack knowledge, experience, or imagination to consider approaches beyond the status quo. Or, as we’ll discuss shortly, they may be under incredible stress and unable to think clearly or creatively.

In that case, you may need to provide more direct guidance, at least in the beginning.25 So here are some tools for providing that guidance.

The Test Pyramid and Vital Signs

Lasting improvements to software quality—and Business as Usual—begin with making quality work and its impact visible at a fundamental level. The Test Pyramid and Vital Signs are concepts accessible to any team committed to this principle. Applying these models is the first step towards ending the suffering caused by the Normalization of Deviance, Groupthink, and the Arms Race.

Before getting into the details, let’s understand the specific problems we need to solve.

Working back from the desired experience

Inspired by Steve Jobs Insult Response

In this famous Steve Jobs video, he explains the need to work backward from the customer experience, not forward from the technology. So let’s compare the experience we want ourselves and others to have with our software to the experience many of us may have today.

What we want What we have
Delight Suffering
Efficiency Waste
Confidence Risk
Clarity Complexity
  • What we want
    We want to experience Delight from using and working on high quality software,26 which largely results from the Efficiency high quality software enables. Efficiency comes from the Confidence that the software is in good shape, which arises from the Clarity the developers have about system behavior.

  • What we have
    However, we often experience Suffering from using or working on low quality software, reflecting a Waste of excess time and energy spent dealing with it. This Waste is the result of unmanaged Risk leading to lots of bugs and unplanned work. Bugs, unplanned work, Risk, and fear take over when the system’s Complexity makes it difficult for developers to fully understand the effect of new changes.

Difficulty in understanding changes produces drag— i.e., Technical debt.

Difficulty in understanding how new changes could affect the system is the telltale sign of low internal quality, which drags down overall quality and productivity. Recall from earlier that the difference between actual and potential productivity correlated to internal quality is what we identified earlier as technical debt.

This contributes to the common scenario of a crisis emerging…

Replace heroics with a Chain Reaction!

…that requires technical heroics and personal sacrifice to avert catastrophe. To get a handle on avoiding such situations, we need to create the conditions for a positive Chain Reaction. By creating and maintaining the right conditions over time, we can achieve our desired outcomes without stress and heroics.

The main obstacle to replacing heroics with a Chain Reaction isn’t technology…

The challenge is belief—not technology

A little awareness goes a long way

…it’s an absence of awareness or belief that a better way exists.27,28

  • Many of these problems have been solved for decades
    Despite the fact that many quality and testing problems have been solved for decades…29

  • Many just haven’t seen the solutions, or seen them done well…
    …many still haven’t seen those solutions, or seen them done well.30

  • The right way can seem easy and obvious—after someone shows you!
    The good news is that these solutions can seem easy and obvious—after they’ve been clearly explained and demonstrated.31

  • What does the right way look like?
    So how do we get started showing people what the right way to improve software quality looks like?

The Test Pyramid

A balance of tests of different sizes for different purposes

We’ll start with the Test Pyramid model,32,33 which represents a balance of tests of different sizes for different purposes. Realizing that tests can come in more than one size is often a major revelation to people who haven’t yet been exposed to the concept.34 It’s not a perfect model—no model is—but it’s an effective tool for pulling people into a productive conversation about testing strategies for the first time.35

Size Scope Ownership Code
visibility
Dependen­cies Control/ Reliability/ Independ­ence Resource usage/ Maint. cost Speed/ Feedback loop Confidence
Large
(System,
E2E)
Entire
system
QA, some
developers
Details not visible All Low High Slow Entire
system
Medium
(Integration)
Components, services Developers, some QA Some details visible As few as possible Medium Medium Faster Contract
between components
Small
(Unit)
Functions, classes Developers All details visible Few to none High Low Fastest Low level details, individual changes

The Test Pyramid helps us understand how different kinds of tests give us confidence in different levels and properties of the system.36 It can also help us break the habit of writing large, expensive, flaky tests by default.37

  • Small tests are unit tests that validate only a few functions or classes at a time with very few dependencies, if any. They often use test doubles38 in place of production dependencies to control the environment, making the tests very fast, independent, reliable, and cheap to maintain. Their tight feedback loop39 enables developers to detect and repair problems very quickly that would be more difficult and expensive to detect with larger tests. They can also be run in local and virtualized environments and can be parallelized.

  • Medium tests are integration tests that validate contracts and interactions with external dependencies or larger internal components of the system. While not as fast or cheap as small tests, by focusing on only a few dependencies, developers or QA can still run them somewhat frequently. They detect specific integration problems and unexpected external changes that small tests can’t, and can do so more quickly and cheaply than large system tests. Paired with good internal design, these test can ensure that test doubles used in small tests remain faithful to production behavior.40

  • Large tests are full, end to end system tests, often driven through user interface automation or a REST API. They’re the slowest and most expensive tests to write, run, and maintain, and can be notoriously unreliable. For these reasons, writing large tests by default for everything is especially problematic. However, when well designed and balanced with smaller tests, they cover important use cases and user experience factors that aren’t covered by the smaller tests.

Thoughtful, balanced strategy == Reliability, efficiency

Each test size validates different properties that would be difficult or impossible to validate using other kinds of tests. Adopting a balanced testing strategy that incorporates tests of all sizes41 enables more reliable and efficient development and testing—and higher software quality, inside and out.

Inverted Test Pyramid

Many larger tests, few smaller tests

Of course, many projects have a testing strategy that resembles an inverted Test Pyramid, with too many larger tests and not enough smaller tests. This leads to a number of common problems:

  • Tests tend to be larger, slower, less reliable
    The tests are slower and less reliable than they could be compared to relying more on smaller tests.

  • Broad scope makes failures difficult to diagnose
    Because large tests execute so much code, it might not be easy to tell what caused a failure.

  • Greater context switching cost to diagnose/repair failure
    That means developers have to interrupt their current work to spend significant time and effort diagnosing and fixing any failures.

  • Many new changes aren’t specifically tested because “time”
    Since most of the tests are large and slow, this incentivizes developers to possibly skip writing or running them because they “don’t have time.”

  • People ignore entire signal due to flakiness…
    Worst of all, since large tests are more prone to be flaky,42 people will begin to ignore test failures in general. They won’t believe their changes cause any failures, since the tests were failing before—they might even be flagged as “known failures.”43 And as we’ll recall…

  • …fostering the Normalization of Deviance
    …the Space Shuttle Challenger O-rings suffered from “known failures” as well, cultivating the Normalization of Deviance.

Causes

Let’s go over some of the reasons behind this situation, touching on some of the same reasons we covered before.

  • Features prioritized over internal quality/tech debt
    People are often pressured to continue working on new features that are “good enough” instead of reducing technical debt. This may be especially true for organizations that set aggressive deadlines and/or demand frequent live demonstrations.44

  • “Testing like a user would” is more important
    Again, if “testing like a user would” is valued more than other kinds of testing, then most tests will be large and user interface-driven.

  • Reliance on more tools, QA, or infrastructure (Arms Race)
    This also tends to instill the mindset that the testing strategy isn’t a problem, but that we always need more tools, infrastructure, or QA headcount. This is the Arms Race mindset we discussed earlier.

  • Landing more, larger changes at once because “time”
    Because the existing development and testing process is slow and inefficient, individuals try to optimize their productivity by integrating large changes at once. These changes are unlikely to receive either sufficient testing or sufficient code review, increasing the risk of bugs slipping through. It also increases the chance of large test failures that aren’t understood. The team is inclined to tolerate these failures, because there isn’t “time” to go back and redo the change the right way.

  • Lack of exposure to good examples or effective advocates
    As mentioned before, many people haven’t actually witnessed or experienced good testing practices before, and no one is advocating for them. This instills the belief that the current strategy and practices are the best we can come up with.

  • We tend to focus on what we directly control—and what management cares about! (Groupthink)
    In such high stress situations, it’s human nature to focus on doing what seems directly within our control in order to cope. Alternatively, we tend to prioritize what our management cares about, since they have leverage over our livelihood and career development. It’s hard to break out of a bad situation when feeling cornered—and too easy to succumb to Groupthink without realizing it.

So how do we break out of this corner—or help others to do so?

Quality work can be hard to see. It’s hard to value what can’t be seen—or to do much of anything about it.

We have to overcome the fundamental challenge of helping people see what internal quality looks like. We have to help developers, QA, managers, and executives care about it and to resist the Normalization of Deviance and Groupthink. We need to better show our quality work to help one another improve internal quality and break free from the Arms Race mindset.

In other words, internal quality work and its impact is a lot like The Matrix…

“Unfortunately, no one can be told what the Matrix is. You have to see it for yourself.”

—Morpheus, The Matrix

One way to start showing people The Matrix is to get buy-in on a set of…

Vital Signs

…“Vital Signs.” Vital Signs are…

A collection of signals designed by a team to reflect quality and productivity and to rapidly diagnose and resolve problems

Intent

  • Comprehensive and make sense to the team and all stakeholders.
    They should be comprehensive and make sense at a high level to everyone involved in the project, regardless of role.

  • Not merely metrics, goals, or data
    We’re not collecting them for the sake of saying we collect them, or to hit a goal one time and declare victory.45

  • Information for repeated evaluation
    We’re collecting them because we need to evaluate and understand the state of our quality and productivity over time.

  • Inform decisions whether or not to act in response
    These evaluations will inform decisions regarding how to maintain the health of the system at any moment.

Common elements

Some common signals include:

  • Pass/fail rate of continuous integration system
    The tests should almost always pass, but failures should be meaningful and fixed immediately.

  • Size, build and running time, and stability of small/medium/large test suites
    The faster and more stable the tests, the fewer resources they consume, and the more valuable they are.

  • Size of changes submitted for code review and review completion times
    Individual changes should be relatively small, and thus easier and faster to review.

  • Code coverage from small to medium-small test suites
    Each small-ish test should cover only a few functions or classes, but the overall coverage of the suite should be as high as possible.46

  • Passing use cases covered by medium-large to large and manual test suites
    For larger tests, we’re concerned about whether higher level contracts, use cases, or experience factors are clearly defined and satisfied before shipping.

  • Number of outstanding software defects and Mean Time to Resolve
    Tracking outstanding bugs is a very common and important Vital Sign. If you want to take it to the next level, you can also begin to track the Mean Time to Resolve47 these bugs. The lower the time, the healthier the system.

Other potentially meaningful signals

Some other potentially meaningful signals include…

  • Static analysis findings (e.g., complexity, nesting depth, function/class sizes)
    Popular source control platforms, such as GitHub, can incorporate static analysis findings directly into code reviews as well. This encourages developers to address findings before they land in a static analysis platform report.48

  • Dependency fan-out
    Dependencies contribute to system and test complexity, which contribute to build and test times. Cutting unnecessary dependencies and better managing necessary ones can yield immediate, substantial savings.

  • Power, performance, latency
    These user experience signals aren’t caught by traditional automated tests that evaluate logical correctness, but are important to monitor.

  • Anything else the team finds useful for its purposes
    As long as it’s a clear signal that’s meaningful to the team, include it in the Vital Signs portfolio.

Use them much like production telemetry

Treat Vital Signs like you would any production telemetry that you might already have.

  • Keep them current and make sure the team pays attention to them.

  • Clearly define acceptable levels—then achieve and maintain them.

  • Identify and respond to anomalies before urgent issues arise.

  • Encourage continuous improvement—to increase productivity and resilience.

  • Use them to tell the story of the system’s health and team culture.

To this last point, we’ll return to the importance of storytelling later.

Example usage: issues, potential causes (not exhaustive!)

Here are a few hypothetical examples of how Vital Signs can help your team identify and respond to issues.

  • Builds 100% passing, high unit test coverage, but high software defects
    If your builds and code coverage are in good shape, but you’re still finding bugs…

    • Maybe gaps in medium-to-large test coverage, poorly written unit tests
      …it could be that you need more larger tests. Or, it could be your unit tests aren’t as good as you think, executing code for coverage but not rigorously validating the results.
  • Low software defects, but schedule slipping anyway
    If you don’t have many bugs, but productivity still seems to be dragging…

    • Large changes, slow reviews, slow builds+tests, high dependency fan out
      …maybe people are still sending huge changes to one another for review. Or maybe your build and test times are too slow, possibly due to excess dependencies.
  • Good, stable, fast tests, few software defects, but poor app performance
    Maybe builds and tests are fine, and there are few if any bugs, but the app isn’t passing performance benchmarks.

    • Discover and optimize bottlenecks—easier with great testing already in place!
      In that case, your investment in quality practices has paid off! You can rigorously pursue optimizations, without the fear that you’ll unknowingly break behavior.

Getting started, one small step at a time

Here are a few guidelines for getting started collecting Vital Signs. First and foremost…

  • Don’t get hung up on having the perfect tool or automation first.
    Do not get hung up on thinking you need special tools or automation at the beginning. You may need to put some kind of tool in place if you have no way to get a particular signal. But if you can, collect the information manually for now, instead of wasting time flying blind until someone else writes your dream tool.

  • Start small, collecting what you can with tools at hand, building up over time.
    You also don’t need to collect everything right now. Start collecting what you can, and plan to collect more over time.

  • Focus on one goal at a time: lowest hanging fruit; biggest pain point; etc.
    As for which Vital Signs to start with, that’s totally up to you and your team. You can start with the easiest signals, or the ones focused on your biggest pain points—it doesn’t matter. Decide on a priority and focus on that first.

  • Update a spreadsheet or table every week or so—manually, if necessary.
    If you don’t have an automated collection and reporting system handy, then use a humble spreadsheet or wiki table. Spend a few minutes every week updating it.

  • Observe the early dynamics between the information and team practice.
    Discuss these updates with your team, and see how it begins to shift the conversation—and the team’s behavior.

  • Then make a case for tool/infrastructure investment based on that evidence.
    Once you’ve got evidence of the value of these signals, then you can justify and secure an investment in automation.49

Building a Software Quality Culture

By asking good questions, spreading awareness of the Test Pyramid, and making internal quality visible via Vital Signs, you’re shifting the culture of your team. The next step is to change the culture of your organization.

First, let’s define specifically what we mean by culture.50 One possible definition is that:

Culture is the shared lifestyle of a team or organization.

This lifestyle is what you see people doing together day to day, and the way they do it. For our purposes, we need to understand the essence of lifestyle, where it comes from and what shapes it. So here’s an expanded definition of “culture”:

Culture is the emergent result of a shared mindset manifest through concrete behaviors.

In order to influence lifestyle, which is the result, we have to influence concrete behaviors. In order to influence those, we need to influence people’s mindset. The most effective way to influence mindsets is to…

Sell—don’t tell!

People don’t like being told to change their behaviors, because it’s like being told to change their minds. If you know anything about people, you know they hate changing their minds unless they are doing the changing, by their own choice.

This is why we’ve emphasized asking questions, raising awareness, and working together to make quality visible—instead of imposing process changes or technical solutions through force. People need to understand and buy into changes in order to embrace them fully. We can’t force them to make changes they don’t perceive as necessary or valuable if we want the change to be successful.

Of course, not everyone’s going to change their mindset at once—some may never come around at all. However, our ultimate goal should be to…

Make the right thing the easy thing!51

As we continue working to improve quality and make it visible, it will get easier and easier to do both. Practices and their results will become more accessible, encouraging wider and wider adoption. Eventually, we want to make it harder not to do the right thing, because the right thing will happen by default.

This will be challenging, and take time. It’s important to identify the right people to engage directly in the beginning, when we’re starting the process, and who to put off until later.

Focus on the Early Majority/Total Product

Geoffrey A. Moore, Crossing the Chasm, 3rd Edition

Focus on the Early Majority/Total Product Adapted from Geoffrey A. Moore, Crossing the Chasm, 3rd Edition. Innovators Early Adopters Instigators THE CHASM Early Majority Late Majority Laggards

The “Crossing the Chasm” model from Geoffrey Moore’s book of the same name can help us make that identification.52 There are many nuances to it, but at a high level, it illustrates how different segments of a population respond to a particular innovation.

  • Innovators and Early Adopters are like-minded seekers, enthusiasts and visionaries who together bring an innovation to the market and lead people to adopt it. I like to lump them together and call them Instigators.

  • The Early Majority are pragmatists who are open to the new innovation, but require that it be accessible and ready to use before adopting it.

  • The Late Majority are followers waiting to see whether or not the innovation works for the Early Majority before adopting it.

  • Laggards are the resisters who feel threatened by the innovation in some way and complain about it the most. They may potentially raise valid concerns, but often they only bluster to rationalize sticking with the status quo.53

The Instigators face the challenge of bringing an innovation across The Chasm separating them from the Early Majority, developing what Moore calls The Total Product. Developing the Total Product requires that the Instigators identify and fulfill several needs the Early Majority have in order to facilitate adoption.

As Instigators leading people to improve software quality and make it visible, focus your energy on connecting with other Instigators and the early Early Majority. Don’t worry so much about the rest—focus on delivering the Total Product, and it will take care of the other groups.

The Rainbow of Death

mike-bland.com/the-rainbow-of-death

The Rainbow of Death mike-bland.com/the-rainbow-of-death. Intervene Validate Inform Inspire Mentor Empower Dependent Independent

This connection across the chasm isn’t part of the original Chasm model, but one I borrowed from a friend54 and called “The Rainbow of Death.” It helps illustrate those Early Majority needs the Instigators must satisfy. Doing so transforms the Early Majority from being dependent on the Instigators’ expertise into independent experts themselves.

My talk “The Rainbow of Death,” linked here, uses this model to tell the story of Google’s Testing Grouplet. It extracts order and clarity from five years of chaos, evolution, and eventually revolution.55 However, I’ve realized that it’s a great storytelling device, but too complicated a device to apply at the beginning of the change process.56

Instigating Culture Change

Essential needs an internal community must support

Instead, once you’ve started forming an internal community of like-minded fellow Instigators, it’s best to focus on these essential needs and simplify your initial efforts:57

  • Individual Skill Acquisition
  • Team/Organizational Alignment
  • Quality Work/Results Visibility

Supporting each of these needs also helps support the other two, creating a virtuous cycle. No matter where you decide to focus as a starting point, the cycle gains momentum with every contribution of effort.58 However, it is important to focus on getting one effort fully in motion before trying to launch the next one.59

I’m going to give an overview of each of these needs and some ideas of how to address them. You can use these suggestions to determine a starting point, and to start percolating ideas for future efforts.

Individual Skill Acquisition

Help individuals incorporate principles, practices, language

Quality begins with the choices each of us make as individuals throughout our day. Awareness of sound quality principles and practices improves the quality of these choices. Developing a common language makes these principles and practices visible, so we can show them to one another, helping raise everyone’s quality game.

  • Training, documentation, internal media, mentorship, sharing examples
    We can offer training, documentation, and other internal media to spread awareness. We can also offer direct mentorship or share examples from our own experience to help others learn.

Now here’s a quick, high level summary of a few key principles and practices to help developers write better code and tests.

  • Testable code/architecture is maintainable—tests add design pressure, enable continuous refactoring; use code coverage as a tool, not a goal
    Designing code for testability, given proper guidance on principles and techniques, adds design pressure that yields higher quality code in general. Having good tests then enables constant improvements to code quality through continuous refactoring, instead of stopping the world for complex, risky overhauls or rewrites.60

    Good tests also enable developers to use code coverage as a tool while refactoring, helping ensure new and improved code replaces the previous code.61

  • Stop copy/pasting code; send several small code reviews, not one big one
    Two common habits that contribute to worse code quality are duplicating code62 and submitting large changes for review.63 These changes make code difficult to read, test, review, and understand, which hides bugs and makes them difficult to find and fix after they’ve shipped. Helping people write testable code also helps people break these costly bad habits.

  • Tests should be designed to fail: naming and organization can clarify intent and cause of failure; use Arrange-Act-Assert (a.k.a. Given-When-Then)
    The goal of testing isn’t to make sure tests always pass no matter what. The goal is to write tests that let us know, reliably and accurately, when our expectations of the code’s behavior differ from reality. Therefore, we should apply as much care to the design, naming, organization of our tests as we do to our production code.64

    Merely showing people the immediately graspable Arrange-Act-Assert (or Given-When-Then) pattern can be a profound revelation that changes their perspective forever.

  • Interfaces/seams enable composition, dependency breaking w/ test doubles
    Of course, many of us start out in legacy code bases with few tests, if any.65 So we also need to teach how to make safe changes to existing code that enable us to begin improving code quality and adding tests. Michael Feathers’s Working Effectively with Legacy Code is the seminal tome on this subject, showing how to gently break dependencies to introduce seams. “Seams” are points at which we introduce abstract interfaces that enable test doubles to stand in for our dependencies, making tests faster and more reliable.66

Speaking of interfaces, Scott Meyers, of Effective C++ fame, gave perhaps the best design advice of all for writing testable, maintainable, understandable code in general:

“Make interfaces easy to use correctly and hard to use incorrectly.”

Scott Meyers, The Most Important Design Guideline?

To propose a slight update to make it more concrete:

“Make interfaces easy to use correctly and hard to use incorrectly—like an electrical outlet.”

—With apologies to Scott Meyers, The Most Important Design Guideline?

Of course, it’s not impossible to misuse an electrical outlet, but it’s a common, wildly successful example that people use correctly most of the time.67 Making software that easy to use or change correctly and as hard to do so incorrectly may not always be possible—but we can always try.

Team/Organizational Alignment

Get everyone speaking the same language

Living up to that standard is a lot easier when the people you work with also consider it a priority.68 Here’s how to create the cultural space necessary for people to apply successfully the new insights, skills, and language we’ve discussed.

  • Internal media, roadmap programs, presentations, advocates
    Use internal media like blogs and newsletters to start a conversation around software quality and to start developing a common language. Roadmap programs create a framework for that conversation by outlining specific improvements teams can adopt. Team and organizational presentations can rely on the quality language and roadmap to inspire an audience and make the concepts more memorable. Software quality advocates can then use all these mechanisms to drive progress.

  • Understanding between devs, QA, project managers/owners, executives
    Articulate how everyone plays a role in improving software quality, and get them all communicating with one another! An executive sponsor or project manager may not need to understand the fine details of dependency injection and test doubles. However, if they understand the Test Pyramid, they can hold developers and QA accountable for improving quality by implementing a balanced, reliable, efficient testing strategy.

  • Focus and simplify! (Don’t swallow the elephant—leave some for later!)
    This is a lot of work that will take a long time. Rather than getting overwhelmed or spreading oneself too thin trying to swallow the entire elephant, focus and simplify by delivering one piece at a time.

  • Every earlier success lays a foundation and creates space for future effort
    Delivering the first piece creates more space to deliver the second piece, then the third, and so on.

  • Be agile—make plans, but recalibrate often
    Of course, this process need not be strictly linear. It’s important to be clear about priorities and delivering pieces over time, but make adjustments as everyone gains experience and the conversation unfolds.69

  • Absorb influences like a musician/band—then create your own voice/style
    Ultimately the process is a lot like helping one another grow as a musicians. It’s not about everyone doing exactly as they’re told. Everyone should be absorbing ideas, trying them out, gaining experience with them, and ultimately making them their own, part of their individual style. Then everyone can share what they’ve learned and discovered, enriching all of us further by adding their own voice to the ongoing conversation.

Roadmap Programs

Guidelines, language, conversation framework, examples

Let’s examine the value of roadmap programs more closely. They can provide the language and conversational framework for the entire program, as well as other powerful features.

  • Define beginning, middle, and end—break mental barrier of where to start
    One of the most important features a roadmap provides is helping teams focus on getting started. It can help overcome the mental barrier of not knowing where to begin by helping to visualize the beginning, middle, and end of the journey. This helps break “analysis paralysis” by narrowing the options at each stage, and providing a rough order in which to implement them.

  • Align dev, QA, project management, management, executives
    A roadmap can help produce alignment across the various stakeholders in a project, making clear what will be done, by who, and in what order. Shared language and collective visibility encourage common understanding, open communication, and accountability.

  • Recommend common solutions, but don’t force a prescription
    A good roadmap won’t force specific solutions on every team, but will provide clear guidelines and concrete recommendations. Teams are free to find their own way to satisfy roadmap requirements, but most teams stand to benefit from recommendations based on others’ experience.

  • Give space for conversation and experience to shape the way
    The point of a roadmap isn’t to guarantee a certain outcome, or to constrain variations or growth. It’s to help teams communicate their intentions, coordinate their efforts, and adjust as necessary based on what they learn together throughout the process.

  • Provide a framework to make effort and results visible, including Vital Signs
    A roadmap helps teams focus, align, communicate, learn, and accomplish shared goals by making software quality work and its impact visible. It helps people talk about quality by giving it a local habitation and a name. Then it provides guidelines and recommendations on implementing Vital Signs that make quality efforts and outcomes as tangible as developing and shipping features.

  • Can help other teams learn by example and follow the same path
    Finally, a roadmap that makes software quality work and its results visible to one team can also make it visible to others. This visibility can inspire other teams to follow the same roadmap and learn from one another’s example. Once a critical mass of teams adopts a common roadmap, though the details may differ from team to team, the broader organizational culture evolves.

To continue the musical metaphor, roadmaps should act mainly as lead sheets that outline a tune’s melody and chords, not as note for note transcriptions. They provide a structure for exploring a creative space in harmony with other players, but leave a lot of room for interpretation and creativity. At the same time, studying transcriptions and recordings to learn the details of what worked for others is important to developing one’s own creativity.

Quality Work and Results Visibility

Storytelling is essential to spreading language, leading change.

Finally, great storytelling is essential to providing meaningful insight into quality outcomes and the work necessary to achieve them.70 This can happen throughout the process, but sharing outcomes, methods, and lessons learned is critical to driving adoption of improved practices and making them stick.

  • Media, roadmaps, presentations, events
    In fact, good stories can drive alignment via the same media we discussed earlier. Organizing a special event every so often can generate a critical mass of focus and energy towards sharing stories from across the company. Such events can raise the company wide software quality improvement effort to a new plateau. They also help prove that common principles and practices apply across projects, no matter the tech stack or domain, refuting the Snowflake Fallacy.

  • Make a strong point with a strong narrative arc
    The key to telling a good story is adhering a strong narrative arc.71 Here are three essential elements:

    • Show the results up front—share your Vital Signs!
      First, don’t bury the lede!72 We’re not trying to hook people on solving a mystery, we’re trying to hook people on the value of what we’re about to share. This holds whether you’ve already achieved compelling outcomes or if you’re still in the middle of the story and haven’t yet achieved your goals. In the latter case, you can still paint a compelling picture of what you’re trying to achieve, and why.

      Either way, having meaningful Vital Signs in place can make telling this part of the story relatively straightforward.

    • Describe the work done to achieve them, and why
      Next, tell them what you had to do (or are trying to do now) to achieve these outcomes and what you learned while doing it. Don’t just give a laundry list of details, however.

      • Practices need principles! The mindset is more portable than the details.
        Practices need principles.73 Help people understand why you applied specific practices—show how they demonstrate the mindset74 that’s ultimately necessary to improve software quality. Technical details can be useful to make the principles concrete, but they’re ultimately of secondary importance to having the right mindset regardless of the technology.
    • Make a call to action to apply the information
      Finally, give people something to do with all the information you just shared. Tell them how they can follow up with you or others, via email or Slack or whatever. Provide links to documentation or other resources where they can learn more about how to apply the same tools and methods on their own.

Building a Software Quality Culture

Cultivating resources to support buy-in

Resources Skills Alignment Visibility
Training    
Documentation    
Internal media
e.g., blogs, newsletters
Roadmap program  
Vision/strategy presentations  
Mentors/advocates
Internal events

Here we can see how different resources serve to fulfill one or more of the essential needs for organizational change. There’s no specific order in which to build up these resources—it’s up to you to decide where to focus at each stage in your journey.

Mapping the Testing Grouplet and Quality Culture Initiative activities onto this table reveals how the same basic resources apply across vastly different company cultures.

Google Testing Grouplet

2005-2010

Resources Examples
Training Noogler (New Googler) Training, Codelabs
Documentation Internal wiki
Internal media
e.g., blogs, newsletters
Testing on the Toilet
Roadmap program Test Certified
Vision/strategy presentations Google Web Server story
Mentors/advocates Test Mercenaries
Internal events Two Testing Fixits, Revolution Fixit, TAP Fixit

At Google, we provided introductory unit testing training and Codelab sessions to new employees, or “Nooglers.” We made extensive use of the internal wiki, and of course Testing on the Toilet was our breakthrough documentation hit. TotT helped people to participate in the Test Certified program, which was based on the experience of the Google Web Server team. Then we scaled up our efforts by building the Test Mercenaries team and hosting four companywide Fixits over the years.

Apple Quality Culture Initiative

2018-present

Resources Examples
Training 16-course curriculum for dev, QA, Project Managers
Documentation Confluence
Internal media
e.g., blogs, newsletters
Quality Blog, internal podcast
Roadmap program Quality Quest
Vision/strategy presentations QCI Roadshow, official internal presentation series
Mentors/advocates QCI Ambassadors
Internal events QCI Summit

At Apple, we knew that posting flyers in Apple Park bathrooms wouldn’t fly, but our extensive Training curriculum was wildly successful. We also made extensive use of our internal Confluence wiki, maintained a Quality Blog, and had a ton of fun producing our own internal podcast. Quality Quest was directly inspired by Test Certified, but adapted by the QCI community to better serve Apple’s needs.75 We promoted our resources via dozens of QCI Roadshow presentations for specific teams and groups, as well as a few official, high visibility internal presentations. We recruited QCI Ambassadors from different organizations to help translate general QCI resources and principles to fit the needs of specific orgs. Finally, we organized a QCI Summit promote software quality stories from across the company, demonstrating how the Quality Mindset applies regardless of domain.

This comparison raises an important point that I’ve made in response to a common question:

“What are the [important] differences between companies?”

“What are the differences between companies?” with the assumption or implication that the differences are of key importance. Reflecting upon this just before leaving Apple, I realized…

The superficial details may differ…

…of course there are obvious differences. Google’s internal culture was much more open by default, and people back in the day had twenty percent of their time to experiment internally. Apple’s internal culture isn’t quite as open, and people are held accountable to tight deadlines. Even so…

…but they’re more alike than different.

…the companies are more alike than they might first seem. Both are large organizations composed of the same stuff, namely humans striving for both individual and collective achievement. Much like code from different projects, at the molecular level, they’re more alike than they are different.

Over time, I’ve come to appreciate these similarities as being ultimately more important than the differences. The same essential issues emerge, and the same essential solutions apply, differing only in their surface appearances. So no matter what project you’re on, or what company you work for, everybody everywhere is dealing with the same core problems. Not even the biggest of companies is immune, or otherwise special or perfect.

On the bright side, no matter what project you’re on, or what company you work for…

Calls to Action

…you can do something about these problems.76 The resources for dealing with them are just as available to you as they are to any other company, given the right mindset.

I’ve already listed several technical and organizational concepts comprising the substance of the changes you may need to make. These are important, but relatively straightforward to grasp. The harder part of the problem isn’t getting people to pay attention to these concepts and to understand them, but to act on them. The most important skill you’ll need to make that happen…

Learn about leadership!

The biggest challenge isn’t technical—it’s changing the mindset.

…is leadership.

Many technical people think they’re above this fuzzy “people stuff,” that driving improvements are all about data and logic and meritocracy. But look where that’s gotten us as an industry with regard to software quality and avoidable damage done to society. The purely technological mindset, and its lack of appreciation for “people stuff,” is why good practices like those we’ve discussed so often fail to spread.

Leadership is also an eminently transferable skill, highly valuable and useful no matter where you find yourself during your career. It has nothing to do with the title you happen to hold, but with how you conduct yourself to achieve alignment with others. It’s a vast topic that you can study for life, but here are a few of my favorite starting points at the moment:

  • John “Add Value to People” Maxwell, The 5 Levels of Leadership
    John Maxwell’s personal mission is to “add value to people,” which is my favorite short definition of “leadership.” His book The 5 Levels of Leadership clearly illustrates how leadership transcends title and decision making authoritah, and what’s required to realize outstanding leadership potential.

  • L. David Marquet, Leadership is Language
    David Marquet’s Leadership is Language is a playbook highlighting how to replace Industrial Revolution era communication habits with more empowering and productive modern habits. It illustrates how to stop merely telling people what to think and do, and how to encourage everyone to grow as decision makers and leaders. The results can literally make the difference between life and death.

  • Liz Wiseman, Multipliers
    In a similar vein, Liz Wiseman’s Multipliers illustrates the distinction between “Diminishers” that drain intelligence and energy from organizations an “Multipliers” that amplify people’s capabilities. It catalogs and contrasts the behaviors of each, encouraging explicit awareness of our own tendencies, weaknesses, and strengths.

  • Scott Miller and David Morey, The Leadership Campaign
    Miller and Morey’s The Leadership Campaign is a guide to the dynamics of stepping forward to lead a movement, modeled directly on political campaigning. The focus is on clarity of messaging and organization, building momentum, and taking advantage of opportunities.

  • Michael Bungay Stanier, The Coaching Habit
    Michael Bungay Stanier’s The Coaching Habit is another book focused on language, specifically when it comes to leading individuals to think through their own challenges. It talks about how to “tame your Advice Monster” and focusing on helping people develop their own solutions and capabilities.

  • Ken Blanchard, et. al., Leadership and the One-Minute Manager
    Leadership and the One-Minute Manager describes Blanchard’s “Situational Leadership” model. This model describes the need to adapt one’s leadership style to each individual over time based on their current capabilities.

Now remember from Crossing the Chasm that it falls on the Instigators to lead adoption of technologies and practices across an organization. As an Instigator, one of the lessons you’ll learn is that, sometimes…

Instigator Theory

It’s easier to change the rest of the world than your own team.

…it’s easier to change the rest of the world than your own team. I call this phenomenon Instigator Theory. However, as frustrating as this is, and as long as it takes to overcome, the basic outline of what you need to do is straightforward:

  • Phase One: Connect with—or build—a community of fellow Instigators
    First, find your people. Put your feelers out. Invite folks to coffee, then start organizing informal meetings, and send out open invitations. See who really cares about software quality and is willing to show up to do something about it.

  • Phase Two: Develop resources and do the work
    Next, employ your leadership skills and challenge the community to develop resources for helping individuals acquire new skills and teams align on quality practices.

    • Focus, simplify, and take your time in the beginning
      As mentioned before, there’s no need to rush, and take care not to spread yourselves too thin.

    • Every earlier success lays a foundation and creates space for future effort
      In time, every win you deliver will draw more people into the community, which then creates the capacity to deliver the next win.

  • Phase Three: Share the results—make the work and its impact visible
    As your community begins delivering resources, and people put those resources to good use, make all that work and its impact visible early and often. Radiate the good work your doing and its results into the environment as much as you can. And as part of generating that radiation…

    • Recognize the value of one another’s contributions!
      …make sure to recognize the value that each member of the community adds to the effort! This is long term work that’s often thankless, as the focus for most of the organization remains on doing business as usual. Recognizing everyone’s value is a big part of keeping up morale and momentum, and makes that value visible to others in the organization as well.

Finally, I’d like to leave you with a concrete list of things you and your fellow Instigators can work to change.

Where we are Where we’d like to go
Slow, unreliable, expensive processes Fast, reliable, efficient feedback loops
Lots of duplicated, complex code Well-factored, readable, testable code
Large, complex, monolithic code reviews Small, digestible, easily reviewable changes
Large, complex, flaky test suites Balanced, stable Test Pyramid-based suites
Expensive metrics people can’t act upon Meaningful, useful Vital Signs taken seriously
Reinventing the wheel in wasteful silos Sharing stories, language, useful examples
Complexity, risk, waste, and suffering Clarity, confidence, efficiency, and delight
Testing (only) like a user would Testing like a user’s life depends on it!

Let’s compare where many of us are today, without good quality practices in place, to where we’d like to get everyone to go.77

  • Ultimately, we want to replace painful, expensive processes with fast, reliable, and efficient feedback loops. We can start to do that by…

  • …rejecting duplication and excess complexity in our code, and by writing readable, testable code instead.

  • We can reject large, monolithic code reviews hiding lots of bugs and insist upon smaller, more reviewable changes.

  • We can reduce the size, complexity, and unreliability of existing test suites by evolving towards a balanced, reliable suite based on Test Pyramid concepts.

  • We can throw out meaningless metrics that are expensive and painful to collect in favor of meaningful, actionable, and relatively cheap Vital Signs.

  • We can stop wasting resources on having teams wrestle with common quality problems separately, and help one another by sharing stories, language, and working examples.

  • Improving our software quality can minimize complexity, risk, waste, and suffering, and the increased understanding it affords will yield clarity, confidence, efficiency, and delight.

  • Once freed from the mental trap of testing only like a user would, we can begin testing like a user’s life depends on it.

Ultimately, creating great, high quality software shouldn’t require heroics, sacrifice, or endless pursuit of technologies or resources.

Making software quality visible will…
start a Chain Reaction that will…
minimize suffering—and ultimately…
Make the right thing the easy thing!

Ensuring that everyone can see what high software quality work looks like helps create the conditions for a positive Chain Reaction. As principles and practices spread, and priorities align around quality, we’ll see suffering subside as we keep making the right thing easier! Then maybe one day, it’ll be so easy everyone can’t help but to do the right thing by default.

Thank you!

Acknowledgments

I appreciates all the folks who’ve contributed to this presentation!

Ono Vaticone, Microsoft
John Turek, Aetion
Isaac Truett, EAB
Chris Douglas, AARP
Jake Spracher
Oleksiy Shepetko, Microsoft
Alex Buccino, Squarespace

And my fellow QCI Instigators at Apple for your past wisdom—
you know who you are!
(And you know that I know who you are!)

History

2023-01-12: Presented to Aetion at the invitation of John Turek, a former Google colleague.

2023-01-17: Presented to Microsoft at the invitation of Ono Vaticone, a former Apple colleague and Quality Culture Initiative member.

TODOs

Here are some items I’m still thinking about adding to the script, most likely as footnotes:

Footnotes

  1. Joel Schwartzberg’s Get to the Point! Sharpen Your Message and Make Your Words Matter inspired me to articulate this clear, concise point up front. 

  2. David Marchese’s interview with Cal Newport for the New York Times on 2023-01-23, The Digital Workplace Is Designed to Bring You Down, bears mentioning here. Newport notes that with the rise of “knowledge work”, “we fell back to a proxy for productivity, which is visible activity.” Then:

    “Visible activity as a proxy for productivity spiraled out of control and led to this culture of exhaustion, of I’m working all the time, I’m context shifting all over the place, most of my work feels performative, it’s not even that useful.”

    He also noted Peter Drucker’s coining of the term “knowledge work” in 1959 and the consequences for management:

    “So Drucker is saying that knowledge workers need to manage themselves. Managers just need to set them up to succeed. But then what do you manage? Visible activity as a proxy for productivity was the solution. We need something we can focus on day to day and feel that we’re having a role in pushing work: Let’s just manage visible activity. It’s this compromise that held the pieces together in an imperfect way, and then in the last 20 years, this centrifuge of digital-accelerated work blew it apart. The compromise is now failing.”

    So there is a danger that trying to make work visible could dissolve into productivity theatre. At the same time, Newport unpacks his concept of “slow productivity,” the topic of his next book [emphasis mine]:

    “So how do you actually work with your mind and create things of value? What I’ve identified is three principles: doing fewer things, working at a natural pace,9 but obsessing over quality. That trio of properties better hits the sweet spot of how we’re actually wired and produces valuable meaningful work, but it’s sustainable.”

    9 Meaning one with more variability in intensity than the always-on pace to which we’ve become accustomed.

    This presentation walks the line between making visible the aspects of our work that truly speak to software quality, and superficial displays of productivity. People often want to jump straight to solutions, and start generating performative “data” to prove their value. In doing so, they fail to grasp the underlying issues and end up continuing the negative cycle of increasing effort yielding decreasing quality.

    We first need to help people get a handle on the issues and understand what we need to accomplish. This is why this talk makes the case for software quality and illustrates its obstacles before discussing solutions. It’s also why the solutions offered are rudimentary guidelines and techniques for inviting nuanced discussion and developing shared understanding that grows over time.

    The punchline being, in the end, improving software quality is about leadership far more than it is about technology. Leadership requires helping people clearly see principles in action and getting results, so that they may learn from the example and achieve similar success. Hence, though making quality work visible may remain an imperfect practice involving trade-offs and compromises, it’s essential to improving software quality broadly across organizations. 

  3. Also see my blog post “Coding and Testing at Google, 2006 vs. 2011.” 

  4. Googlers: My Percent score was over 90% when I left Google in September 2011

  5. Thanks in large part due to the current TotT coordinator, Andrew Trenk

  6. This included Jake Spracher and Kirk Russell, who both left Apple before I did. The other folks are still at Apple, and therefore I’ll leave them anonymous for now. 

  7. He mentions the fact that cruft and technical debt are basically the same in a sidebar, but it’s not on his graph. 

  8. In my talk Automated Testing—Why Bother?, I go into several more reasons why automated testing helps developers understand the system, particularly when responding to failures. These include better managing the focusing illusion, the Zeigarnik effect, the orienting response, and the OODA Loop. (I learned about all of these except for the OODA Loop from Dr. Robert Cialdini’s Pre-Suasion: A Revolutionary Way to Influence and Persuade.) 

  9. This concept of a “buffer” comes from Greg Mckeown’s Essentialism: The Disciplined Pursuit of Less

  10. This seems somewhat ironic, since he invited me to publish Goto Fail, Heartbleed, and Unit Testing Culture on his website in 2014. It doesn’t focus only on the professionalism angle, but it emphasizes it heavily. He published the “Cost” article in 2019, reflecting an apparent evolution in his thinking.

    I’m not criticizing Martin, or his argument—I’m rather grateful he came up with this brilliant angle, and explained it so thoughtfully and clearly. It’s incredibly helpful to move the conversation forward. I’m just not willing to abandon the “moralistic” appeal to professionalism, either. We need both.

    In fact, I’d claim that a sense of professionalism necessarily precedes sound economic arguments in general. Raw economics doesn’t care about professionalism, but pragmatic professionals have to find a way to align the economics with their professional standards. That’s exactly what Martin did with this article.

    Also, though he didn’t explicitly state this, it’s possible he meant “professionalism” in terms of “quality for its own sake” or “pride in one’s work.” Whereas the angle in my article, and in the slides to follow, is “professionalism” in terms of social responsibility, which also has an economic impact. I do believe in quality for its own sake and having pride in one’s work, but that’s not the appeal I tend to make, either.

    All of this said, Robert Greene’s The 48 Laws of Power advises (emphasis mine):

    Law 13: When asking for help, appeal to people’s self-interest, never to their mercy or gratitude

    Though that title speaks specifically about gaining someone’s favor, the general principle of appealing to someone’s self interest to motivate their behaviors holds. That said, in the “Reversal” section at the end of the chapter on Law 13 states:

    You must distinguish the differences among powerful people and figure out what makes them tick. When they ooze greed, do not appeal to their charity. When they want to look charitable and noble, do not appeal to their greed.

    My interpretation of this principle in this context is: Don’t go all in on either the economic argument or appeals to professionalism. Use both, and presented well, I think they serve to reinforce one another.

    So while I understand why Martin has taken the position he has, I’m slightly saddened by it. Or, if he’s responding to certain aspects of professionalism without distinguishing from the others, I’m only sad that he was uncharacteristically unclear on that point. Morals aren’t the only concern, but neither should economics be—alignment between them, rather than abandonment of one for the other, yields the best outcomes. 

  11. Automated Testing—Why Bother? examines a few more reasons. It includes this quote from The Rainbow of Death:

    People mostly had no experience with testing outside of the slowness and brittleness of the status quo, and were under constant delivery pressure while feeling intimidated by many of their peers. Who could blame them for not testing when they couldn’t afford the time to learn?

    Economists call this “temporal discounting”. Basically, if someone’s presented with an option to push a feature now without tests, or prevent a problem in the future (that may or may not happen) through an investment in testing, they’ll tend to ship and hope for the best. Combined with the fact that the ever-slowing tools made it impossible to reach a state of flow, this combination of immediate pain and slow feedback in pursuit of a distant, unclear benefit made the “right” thing way harder than it needed to be.

  12. Shortly after joining one team, I presented to my teammates my vision for improved testing adoption across the company and what it would take. One of my teammates said to me in this meeting, “…but unit testing is easy!” Caught off guard, my immediate impulse—which I didn’t catch in time—was to laugh out loud at this statement. I immediately apologized and explained that, yes, it isn’t that hard once you get used to it—but many haven’t yet learned good basic techniques. (I cover this a little later in this talk.)

    Of course, my apology meant nothing—the damage was done. This teammate and I never ended up really seeing eye to eye. Per the Crossing the Chasm model covered later in “Building a Software Quality Culture,” I moved on rather than continuing to engage with this Laggard. 

  13. Automated Testing—Why Bother? also mentions relevant reasons from the social psychology research of Dr. Robert Cialdini. From that talk, in reference to Cialdini’s Influence: The Psychology of Persuasion and Pre-Suasion: A Revolutionary Way to Influence and Persuade:

    • Social proof: We follow established norms that we perceive in the behavior of others
    • Authority (Authoritah): We permit others to set our priorities and do as we’re told
    • Scarcity: We act out of the fear of a closing opportunity window
    • Unity: We act in the perceived best interest of others with whom we have a relationship

    So if you’re on a team where testing isn’t the norm, and your boss is expecting you to meet a deadline—especially if your feature is critical to the success of the project, and/or you know you have a promotion at stake—you aren’t likely to write automated tests if you haven’t written any before. Whether you feel like writing tests would leave you vulnerable to the wrath of your manager or that of your team, or you haven’t had any interest to begin with, these forces have the effect of reinforcing the status quo.

  14. I have to admit, this rant was inspired by coming across Tim Bray’s Testing in the Twenties. (I found it by way of Martin Fowler’s On the Diverse And Fantastical Shapes of Testing, which I cite in a later Test Pyramid footnote.) I strongly agree with the article for the most part (especially the “Coverage data” section), but it shits the bed with the “No religion” comments. I even agree with the main points contained in those comments. However, setting them up in opposition to “religion,” “ideology,” “pedantic arm-waving,” “TDD/BDD faith,” etc., brings an unnecessarily negative emotional charge to the argument. It would be much stronger, and more effective, without them.

    Note that Bray’s article is strongly in favor of developers writing effective automated tests. That said, painting people who talk about test doubles and practice TDD as belonging to an irrational tribe (while implying one’s own superiority) is harmful. I’m sorely disappointed that this otherwise magnificent barrel full of wine contains this spoonful of sewage. (A saying I got from the “A Spoonful of Sewage” chapter of Beautiful Code.) 

  15. I first learned about this concept from an Apple internal essay on the topic. 

  16. The full title of chapter six is Chapter VI: An Accident Rooted in History. The data comes from [129-131] Figure 2. O-Ring Anomalies Compared with Joint Temperature and Leak Check Pressure. It lists 25 Space Shuttle launches, ending with STS 51-L. It indicates O-ring anomalies (erosion or blow-by) in 17 of the 24 launches (70%) prior to STS 51-L. In the 17 missions prior, starting with STS 41-B on 1984-02-03, there were 14 anomalies (82%). 

  17. From Chapter VII: The Silent Safety Program., excerpts from “Trend Data” [155-156]:

    As previously noted, the history of problems with the Solid Rocket Booster O-ring took an abrupt turn in January, 1984, when an ominous trend began. Until that date, only one field joint O-ring anomaly had been found during the first nine flights of the Shuttle. Beginning with the tenth mission, however, and concluding with the twenty -fifth, the Challenger flight, more than half of the missions experienced field joint O-ring blow-by or erosion of some kind….

    This striking change in performance should have been observed and perhaps traced to a root cause. No such trend analysis was conducted. While flight anomalies involving the O-rings received considerable attention at Morton Thiokol and at Marshall, the significance of the developing trend went unnoticed. The safety, reliability and quality assurance program, of course, exists to ensure that such trends are recognized when they occur….

    Not recognizing and reporting this trend can only be described, in NASA terms, as a “quality escape,” a failure of the program to preclude an avoidable problem. If the program had functioned properly, the Challenger accident might have been avoided.

  18. The NASA Office of Safety & Mission Assurance site has other interesting artifacts, including:

    This latter artifact is a powerfully concise distillation of lessons from the Rogers report. A couple of excerpts:

    Pre-Launch

    • Launch day temperatures as low as 22 °F at Kennedy Space Center.
    • Thiokol engineers had concerns about launching due to the effect of low temperature on O-rings.
    • NASA Program personnel pressured Thiokol to agree to the launch.

    Lessons Learned

    • We cannot become complacent.
    • We cannot be silent when we see something we feel is unsafe.
    • We must allow people to come forward with their concerns without fear of repercussion.

  19. If you check out the Wilcutt and Bell presentation, and follow the “Symptoms of Groupthink” Geocities link, do not click on anything on that page. It’s long since been hacked. 

  20. In Automated Testing—Why Bother?, I define automated testing as: “The practice of writing programs to verify that our code and systems conform to expectations—i.e. that they fulfill requirements and make no incorrect assumptions.” 

  21. A further thought: Trusting tools like compilers to faithfully translate high-level code to machine code, and to optimize it, is one thing. Compilers are largely deterministic and relatively well understood. AI models are quite another, far more inscrutable, far less trustworthy instrument.

    Another thought: In David Marquet’s short talk on “Greatness”, he explains what he calls “the two pillars of giving control:”

    1. Technical Competence: Is it safe?
    2. Organizational Clarity: Is it the right thing to do?

    Maybe one day we’ll trust AI with the first question. I’m not so sure we’ll ever be able to trust it with the second. 

  22. Feynman’s entire appendix is worth a read, but here’s another striking passage foreshadowing Wilcutt and Bell’s “lack of bad outcomes” assertion:

    There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next.

  23. I learned this principle from an Apple internal essay. 

  24. Like many, I learned many of these aspects of the power of asking good questions from Michael Bungay Stanier’s The Coaching Habit.

    Also, thanks to Wolfgang Trumler for reminding me of the power of asking people what to do versus telling them what to do. 

  25. This insight was inspired by discussion of the Situational Leadership II® model described in Ken Blanchard’s Leadership and the One Minute Manager

  26. My formal Google colleague Alex Buccino made a good point during a conversation on 2023-02-01 about what “delight” means to a certain class of programmers. He noted that it often entails building the software equivalent of a Rube Goldberg machine for the lulz—or witnessing one built by another. I agreed with him that, maybe for that class of programmers, we should just focus on “clarity” and “efficiency”—which necessarily excludes development of such contraptions. 

  27. This speaks to one of my favorite concepts from Saul Alinsky’s Rules for Radicals, which I paraphrased in Automated Testing—Why Bother?:

    If people believe they lack the knowledge and power to solve a problem, they won’t even think of trying to solve it.

  28. This idea was also inspired by Immunity to Change by Robert Kegan and Lisa Lahey.

    Actually…I’ve yet to start the book at the time I’m writing this sentence. I learned about it from Harvard expert on the worst thing about New Year’s resolutions—and how to beat it: ‘A profound loss of energy’ (CNBC, 2022-12-31). That article quotes Lahey’s four step process to “breaking our resistance to change”:

    1. Identify your actual improvement goal, and what you’d need to do differently to achieve it.
    2. Look at your current behaviors that work against your goal.
    3. Identify your hidden competing commitments.
    4. Identify big assumptions about how the world works that drive your resistance to change.

    Assumptions, until identified, are essentially unspoken or unconscious beliefs. 

  29. In the interview from an earlier footnote, The Digital Workplace Is Designed to Bring You Down, Cal Newport makes a relevant observation to this point:

    “If we look through the history of the intersection of technology and commerce, we always see something similar, which is: When disruptive technology comes in, it takes a long time to figure out the best way to use it. There’s this case study5 from a Stanford economist about the introduction of the electric motor into the factory. He characterizes how long it takes before we figure out what was in hindsight the obvious way to use electric motors in factories, which was to put a small motor in every piece of equipment so I can control my equipment at exactly the level I want to use it. We didn’t do that for 20 or 30 years.”

    5 Paul A. David’s “Computer and Dynamo: The Modern Productivity Paradox in a Not-Too-Distant Mirror,” published in 1989.

    In other words, known solutions still take time to sink in and become so obvious and easy to use that they become common practice. So maybe we’re approaching the tipping point as I write this sentence on January 23, 2023, as unit testing is over 30 years old. 

  30. In Software at Scale 53 - Testing Culture with Mike Bland, I discuss with Utsav Shah why effective testing practices haven’t yet caught on everywhere. It seems part of the human condition is that wisdom passed down through the ages still requires that individuals seek it out. Good examples and teachers can help, but those aren’t always accessible to everyone, at least not without some self-motivated effort to find them.

    By way of analogy, I mentioned just having read the Bhagavad Gita, which on the surface is mortifying by today’s standards. The warrior Arjuna doesn’t want to go to war against his own family, and the supreme being, Krishna, convinces him that it’s his duty. However, read as only a metaphor for profound internal conflict and doubt, which was accessible to the audience of the day, the message is reassuring. But it takes a willfully open mind to derive such value.

    On top of that, one of the main lessons is that one should feel attached to doing one’s work—but not to the outcomes. This is a pretty common theme, also running through The Daily Stoic, which I also recently finished. Other traditions, notably Buddhism and Taoism, also teach detachment from outcomes and other things beyond one’s control generally.

    However, despite this message being developed in multiple ancient cultures and spreading throughout history, tradition, and literature, people still struggle with attachment to this day. The essence of such wisdom isn’t necessarily complicated, but it’s often obscured by other natural preoccupations of both individuals and cultures.

    This doesn’t contradict Cal Newport’s observation above on the time it takes for organizations to assimilate new technologies. It perhaps helps explain, at least in part, why it takes so long. 

  31. The Egg of Columbus parable is my favorite illustration of this principle. 

  32. Thanks to Scott Boyd for reminding me to emphasize the Test Pyramid as a key component of the testing conversation. 

  33. Reproducing my footnote from Automated Testing—Why Bother?: Nick Lesiecki drew the original testing pyramid in 2005. No idea if there was prior art, but he didn’t consult it.

    The pyramid was later popularized by Mike Cohn in The Forgotten Layer of the Test Automation Pyramid (2009) and Succeeding with Agile: Software Development Using Scrum (2009). Not sure if Mike had seen the Noogler lecture slide or had independently conceived of the idea, but he definitely was a visitor at Google at the time I was there. 

  34. The Testing Grouplet introduced the Small, Medium, Large nomenclature as an alternative to “unit,” “integration,” “system,” etc. This was because, at Google in 2005, a “unit” test was understood to be any test lasting less than five minutes. Anything longer was considered a “regression” test. By introducing new, more intuitive nomenclature, we inspired productive conversations by rigorously defining the criteria for each term, in terms of scope, dependencies, and resources.

    The Bazel Test Encyclopedia and Bazel Common definitions use these terms to define maximum timeouts for tests labeled with each size. Neither document speaks to the specifics of scope or dependencies, but they do mention “assumed peak local resource usages.” 

  35. Some have advocated for a different metaphor, like the “Testing Trophy” and so on, or for no metaphor at all. I understand the concern that the Test Pyramid may seem overly simplistic, or potentially misleading should people infer “one true test size proportion” from it. I also understand Martin Fowler’s concerns from On the Diverse And Fantastical Shapes of Testing, which essentially argues for using “Sociable vs. Solitary” tests. His preference rests upon the relative ambiguity of the terms “unit” and “integration” tests.

    However, I feel this overcomplicates the issue while missing the point. Many people, even with years of experience in software, still think of testing as a monolithic practice. Many still consider it “common sense” that testing shouldn’t be done by the people writing the code. As mentioned earlier, many still think “testing like a user would” is “most important.” Such simplistic, unsophisticated perspectives tend to be resistant to nuance. People holding them need clearer guidance into a deeper understanding of the topic.

    The Test Pyramid metaphor (with test sizes) is an accessible metaphor for such people, who just haven’t been exposed to a nonmonolithic perspective on testing. It honors the fact that we were were all beginners once (and still are in areas to which we’ve not yet been exposed). Once people have grasped the essential principles from the Test Pyramid model, it becomes much easier to have a productive conversation about effective testing strategy. Then it becomes easier and more productive to discuss sociable vs. solitary testing, the right balance of test sizes for a specific project, etc. 

  36. The “confidence” concept in the context of the Test Pyramid was hammered out by Nick Lesiecki, Patrick Doyle, and Dominic Cooney in 2009. 

  37. Thanks to Oleksiy Shepetko for mentioning the maintenance cost aspect during my 2023-01-17 presentation to Ono Vaticone’s group at Microsoft. It wasn’t in the table at that time, and adding it afterward inspired this new, broad, comprehensive table layout. 

  38. Test doubles are lightweight, controllable objects implementing the same interface as a production dependency. This enables the test author to isolate the code under test and control its environment very precisely via dependency injection. (“Dependency injection” is a fancy term for passing an object encapsulating a dependency into code that uses it as a constructor or function argument. This replaces the need for the code instantiating or accessing the dependency directly.)

    I also defined them on slide 49 of Automated Testing—Why Bother?:

    Test doubles are substitutes for more complex objects in an automated test. They are easier to set up, easier to control, and often make tests much faster thanks to the fact that they do not have the same dependencies as real production objects.

    The various kinds of test doubles are:

    • Dummy: A placeholder value with no bearing on the test other than enabling the code to compile
    • Stub: An object programmed to return a hardcoded or trivially computed value
    • Spy: A stub that can remember how many times it was called and with which arguments
    • Mock: An object that can be programmed to validate expected calls in a specific order, as well as return specific values
    • Fake: An object that fully simulates a production dependency using a less complicated and faster implementation (e.g., in memory database or file system, local HTTP server)

    People often call all test doubles “mocks,” and packages making it easy to implement test doubles are often called “mocking libraries.” This is unfortunate, as mocks should be the last option one should choose.

    Mocks can validate expected side effects (i.e., behaviors not reflected in return values or easily observable environmental changes) that other test doubles can’t. However, this binds them to implementation details that can render tests brittle in the face of implementation changes. Tests that overuse mocks in this way are often cited as a reason why people find test doubles and unit testing painful.

    My favorite concrete, physical example of using a test double is using a practice amplifier to practice electric guitar:

    • You can practice in relative quiet, using something even as small as a Marshall micro amp, a Blackstar Fly 3, or a Mustang micro. You do this before playing with others or getting on stage to make sure you’ve got your own parts down. If anything sounds bad, you know it’s all your fault.

      This is analogous to writing small tests, with the practice amp as the test double. You’re figuring out in near real time if your own performance meets expectations or needs fixing—without bothering anyone else.

    • You can rehearse with your band with a larger, louder amplifier, like a Marshall DSL40CR or Fender Mustang GTX100. This enables you to work out issues with other players before getting on stage. If you’ve practiced your parts enough and something sounds bad at this point, you know something’s wrong with the band dynamic.

      This is analogous to writing medium tests, with the slightly larger amp still acting as a test double. You’re figuring out with your bandmates specific issues arising from working through the material together. You can start, stop, and repeat as often as necessary without burdening the audience.

    • You and the band can then run through a soundcheck on stage, making sure everything sound good together while plugging into your Marshall stacks. Everyone else will be using their production gear, the full sound system, and the lighting rig, in the actual performance space. If the band is well rehearsed but something sounds wrong at this level, you know it’s specific to the integration of the entire system.

      This is analogous to writing large tests. You’re using the real production dependencies and running the entire production system. However, this is still happening before the actual performance, giving you a chance to detect and resolve showstopper issues before the performance.

    • Finally, you play in front of the audience. Things can still go wrong, and you’ll have to adapt in the moment and discuss afterwards how to prevent repeat issues. However, after all the practicing, rehearsals, and soundchecks, relatively few things could still go wrong, and are likely unique to actual performance situations.

      This is analogous to shipping to production. You can’t expect perfection, and you may discover new issues not uncovered by previous practicing, rehearsing, and testing. However, you can focus on those relatively few remaining issues, since so many were prevented or resolved before this point.

    Of course, there are more options than this. There’s nothing saying you couldn’t use any of these amplifiers in any other situation—you could use, say, the Fender Mustang GTX100 for everything. It can even plug directly into the mixing deck and emulate a mic’d cabinet. But hopefully the point of the analogy remains clear: The common interface gives you the freedom to swap implementations as you see fit.

    The only question is, what kind of “test double” is a practice amplifier? Based on the definitions above, my money’s on calling it a “fake.” It’s a lighter weight implementation of the full production dependency, with the exact same interface, but without an interface for preprogramming responses.

    (I used images of Marshall stacks vs. a Marshall micro amp on slide 49 of Automated Testing—Why Bother?, but didn’t write them into the narrative.) 

  39. Shoutout to Simon Stewart for being a vocal advocate of shorter feedback loops. See his Dopamine Driven Development presentation. 

  40. Shoutout to Francisco Candalija for bringing contract and collaboration tests to my attention. He influenced how I now think and talk about medium/integration tests and my own “internal API” concept. (Some of the below I also described in an email to my former Google colleague Alex Buccino on 2022-12-23.)

    Contract tests essentially answer the question: “Did something change that’s beyond my control, or did I screw something up?”

    I like thinking of contract tests in this way rather than how Pact defines them, even though the Pact definition is very popular. Writing a contract test quickly using a special tool and calling it a day can provide a false sense of confidence. Such tests are prone to become brittle and flaky if one doesn’t consider how they support the overall architecture and testing strategy.

    An “internal API” is a wrapper that’s kind of a superset of Proxy and Adapter from Design Patterns. It’s an interface you design within your project that translates an external (or complicated internal) dependency’s language and semantics into your own custom version. Using your own interface insulates the rest of your code from directly depending on the dependency’s interface.

    One very common example is creating your own Database object that exposes your own “ideal” Database API to the rest of your app. This object encapsulates all SQL queries, external database API calls, logging, error handling, and retry mechanisms, etc. in a single location. This obviates the need to pepper these details throughout your own code.

    What this means is:

    • The internal API introduces a seam enabling you to write many more fast, stable, small tests for your application via dependency injection and test doubles. (Michael Feathers introduced the term “seam” in Working Effectively with Legacy Code.) This makes the code and the tests easier to write and to maintain, since the all the tests no longer become integration tests by default.
    • You do still need to test your API implementation against the real dependency—but now you have only one object to test using a medium/integration test. This would be your contract test.
    • Any integration problems with a particular dependency are detected by one test, rather than triggering failures across the entire suite. This improves the signal to noise ratio while tightening the feedback loop, making it faster and easier to diagnose and repair the issue.
    • The contract test makes sure any test doubles based on the same interface as the internal API wrapper are faithful to production. If a contract test fails in a way that invalidates your internal API, you’ll know to update your API and test doubles based on it.
    • If you want to upgrade or even replace a dependency, you have one implementation to update, not multiple places throughout the code. This protects your system against revision or vendor lock in.
    • In fact, you can add an entirely new class implementing the same interface and configure which implementation to use at runtime. This makes it easy and safe to try the old and new implementations without major surgery or risk.

    For all these reasons, combining internal APIs with contract tests makes your test suite faster, more reliable, and easier to maintain.

    A concrete example: Like many languages, Python provides a common DBAPI. This enables you to use a local, in memory database (typically SQLite) to fake (i.e. stand in for) a production database.

    I did this not long ago for some Python code that threw a DBAPI error in production every few days, locking up our server fleet:

    • Though we used Postgres in prod, I could simulate the same DBAPI error in a test on my desk by using the standard sqlite3 module.

    • I reproduced the bug, in which the system didn’t abort a failed transaction due to a dropped connection, blocking further operations.

      I wouldn’t call the test “small” or “medium,” but “small-ish.” It was as small a contract test as you could get, and while it wasn’t super fast, it was quite quick.

    • I fixed the bug—and the test—by introducing a Database abstraction that implemented a rollback/reconnect/retry mechanism. The relatively small size, low complexity, and quick speed of the test enabled me to iterate quickly on the solution.

      (I also set a one hour timeout on database connections. This alone might’ve resolved the problem, but it was worth adding the new abstraction that provably resolved the problem.)

    • I shipped the fix—and bye bye production error! I kept monitoring the logs and never saw it happen after that.

    This contract test enabled me to define an internal Database API based on the Python DBAPI. The DBAPI ensures that the Database API can be reused—and tested—with different databases that conform to its specifications. The rest of our code, now using the new Database object, could be tested more quickly using test doubles. So long as the contract test passes, the test doubles should remain faithful substitutes. And if we wanted to switch from Postgres to another production database, likely none of our code would’ve had to change.

    The contract test did require some subtle setup and comments explaining it. Still, dealing with one such test and object under test beats the hell out of dealing with one or more large system tests. And it definitely beats pushing a “fix” and having no idea whether it stands a chance of holding up in production! 

  41. I deliberately avoid saying which specific proportion of test sizes is appropriate. The shape of the Test Pyramid implies that one should generally try to write more small tests, fewer medium tests, and relatively few large tests. Even so, it’s up to the team to decide, through their own experience, what the proportions should be to achieve optimal balance for the project. The team should also continue to reevaluate that proportion continuously as the system evolves, to maintain the right balance.

    I also have scar tissue regarding this issue thanks to Test Certified. Intending to be helpful, we suggested a rough balance of 70% small, 20% medium, and 10% large as a general target. It was meant to be a rule of thumb, and a starting point for conversation and goal setting—not “The One True Test Size Proportion.” But OMG, the debates over whether those were valid targets, and how they were to be measured, were interminable. (Are we measuring individual test functions? Test binaries/BUILD language targets like cc_test? Googlers, at least back then, were obsessed with defining precise, uniform measurements for their own sake.)

    On the one hand, lively, respectful, constructive debate is a sign of a healthy, engaged, dynamic community. However, this particular debate—as well as the one over the name “Test Certified”—seemed to miss the point, amounting to a waste of time. We just wanted teams to think about the balance of tests they already had and needed to achieve, and to articulate how they measured it. It didn’t matter so much that everyone measured in the exact same way, and it certainly didn’t matter that they achieve the same test ratios. It only mattered that the balance was visible within each individual project—and to the community, to provide inspiration and learning examples.

    Consequently, while designing Quality Quest at Apple, we refrained from suggesting any specific proportion of test sizes, even as a starting point. The language of that program instead emphasized the need for each team to decide upon, achieve, and maintain a visible balance. We were confident that creating the space for the conversation, while offering education on different test sizes (especially smaller tests), would lead to productive outcomes. 

  42. “Flaky” means that a test will seem to pass or fail randomly without a change in its inputs or its environment. A test becomes flaky when it’s either validating behavior too specific for its scope, or isn’t adequately controlling all of its inputs or environment—or both. Common sources of flakiness include system clocks, external databases, or external services accessed via REST APIs.

    A flaky test is worse than no test at all. It conditions developers to spend the time and resources to run a test only to ignore its results. Actually, it’s even worse—one flaky test can condition developers to ignore the entire test suite. That creates the conditions for more flakiness to creep in, and for more bugs to get through, despite all the time and resources consumed.

    In other words, one flaky test that’s accepted as part of Business as Usual marks the first step towards the Normalization of Deviance.

    There are three useful options for dealing with a flaky test:

    1. If it’s a larger test trying to validate behavior too specific for its scope, relax its validation, replace it with a smaller test, or both.
    2. If what it’s validating is correct for its scope, identify the input or environmental factor causing the failure and exert control over it. This is one of the reasons test doubles exist.
    3. If you can’t figure out what’s wrong or fix it in a reasonable amount of time, disable or delete the test.

    Retrying flaky tests is NOT a viable remedy. It’s a microcosm of the Arms Race as a whole. Think about it:

    • Every time a flaky test fails, it’s consuming time and resources that could’ve been spent on more reliable tests.
    • Even if a flaky tests fails on every retry, people will still assume the test is unreliable, not their code, and will merge anyway.
    • Increasing retries only consumes more resources while enabling people to continue ignoring the problem when they should either fix, disable, or delete the test.
    • Bugs will still slip through, introduce risk, and create rework even after all the resources spent on retries.

  43. The last thing you want to do with a flaky or otherwise consistently failing test is mark it as a “known failure.” This will only consume time and resources to run the test and complicate any reporting on overall test results.

    Remember what tests are supposed to be there for: To let you know automatically that the system isn’t behaving as expected. Ignoring or masking failures undermines this function and increases the risk of bugs—and possibly even catastrophic system failure.

    Assume you know that a flaky or failing test needs to be fixed, not discarded. If you can’t afford to fix it now, and you can still afford to continue development regardless, then disable the test. This will save resources and preserve the integrity of the unambiguous pass/fail signal of the entire test suite. Fix it when you have time later, or when you have to make the time before shipping.

    Note I said “if you can still afford to continue development,” not “if you must continue development.” If you continue development without addressing problems you can’t afford to set aside, it will look like willful professional negligence should negative consequences manifest. It will reflect poorly on you, on your team, and on your company.

    Also note I’m not saying all failures are necessarily worthy of stopping and fixing before continuing work. The danger I’m calling out is assuming most failures that aren’t quickly fixable are worth setting aside for the sake of new development by default. Such failures require a team discussion to determine the proper course of action—and the team must commit to a clear decision. The failure to have that conversation or to commit to that clear decision invites the Normalization of Deviance and potentially devastating risks. 

  44. Frequent demos can be a very good thing—but not when making good demos is appreciated more than high internal software quality and sustainable development. 

  45. I’ve called this concept of collecting signals to inform decision making “Vital Signs” because I believe “data-driven decision making” has lost its meaning. As often happens with initially useful innovations, the term “data-driven decision making” has become a buzzword. It’s a sad consequence of a misquote of W. Edwards Deming, an early pioneer of data-driven decision making, who actually said:

    “It is wrong to suppose that if you can’t measure it, you can’t manage it—a costly myth.”

    The New Economics, Chapter 2, “The Heavy Losses”

    Over time, this got perverted to “If you can’t measure it, you can’t manage it.” (The perversion is likely because people know him as a data advocate, and are ignorant of the subtlety of his views.)

    Many who vocally embrace data-driven decision making today tend to put on a performance rather than apply the principle in good faith. They tend to want to let the data do the deciding for them, absolving them of professional responsibility to thoughtfully evaluate opportunities and risks. It’s a ubiquitously accepted Cover Your Ass rationale, a shield offering protection from the expectation of ever having to take any meaningful action at all. It also a hammer used to beat down those who would take such action—especially new, experimental action lacking up front evidence of its value. Even so, often “the data shows” that we should do nothing, or do something stupid or unethical. This holds even when other salient, if less quantifiable signals urge action, or a different course of action.

    As such, allegiance to “data-driven decision making” tends to encourage Groupthink and to produce obstacles to meaningful change. On the contrary, “Vital Signs” evokes a sense of care for a living system, and a sense of commitment to ensuring its continued health. It implies we can’t check a box to say we’ve collected the data and can take system quality and health for granted. We have to keep an eye on our system’s Vital Signs, and maintain responsibility for responding to them as required.

    My visceral reaction arises from all the experiences I’ve had (using Crossing the Chasm terminology) with Late Majority members lacking courage and Laggards resisting change. I’ll grant that the Late Majority may err on the side of caution, and once they’re won over, they can become a force for good. But Laggards feel threatened by new ideas and try to use data, or the lack thereof, as a weapon. Then when you do produce data and other evidence, they want to move the goalposts.

    The Early Majority is a different story altogether. I’ve had great experiences with Early Majority members who were willing to try a new approach to testing and quality, expecting to see results later. Once we made those results visible, it justified further investment. This is why it’s important to find and connect with the Early Majority first, and worry about the Late Majority later—and the Laggards never, really. 

  46. I’m often asked if teams should always achieve 100% code coverage. My response is that it one should strive for the highest code coverage possible. This could possibly be 100%, but I wouldn’t worry about going to extreme lengths to get it. It’s better to achieve and maintain 80% or 90% coverage than to spend disproportionate effort to cover the last 10% or 20%.

    That said, it’s important to stop looking at code coverage as merely a goal—use it as a signal that conveys important information. Code coverage doesn’t show how well tested the code is, but how much of the code isn’t exercised by small(-ish) tests at all.

    So it’s important to understand clearly what makes that last 10% or 20% difficult or impractical to cover—and to decide what to do about it. Is it dead code? Or is it a symptom of poor design—and is refactoring called for? Is there a significant risk to leaving that code uncovered? If not, why keep it?

    Another benefit to maintaining high coverage is that it enables continuous refactoring. The Individual skill acquisition section expands on this. 

  47. As the linked page explains, the “R” in “MTTR” can also stand for “Repair,” “Recovery,” or “Respond.” However, I like to suggest “Resolve,” because it includes response, repair, recovery, and a full follow through to understand the issue and prevent its recurrence. 

  48. SonarQube is a popular static analysis platform, but I’m partial to Teamscale, as I happen to know several of the CQSE developers who own it. They’re really great at what they do, and are all around great people. They provide hands-on coaching and support to ensure customers are successful with the system, which they’re constantly improving based on feedback. I’ve seen them in action, and they deeply understand that it’s the tool’s job to provide insight that facilitates ongoing conversations.

    (No, they’re not paying me to advertise. I just really like the product and the people behind it.)

    I also like to half-jokingly say Teamscale is like an automated version of me doing your code review—except it scales way better. The more the tool automatically points out code smells and suggests where to refactor, the more efficient and more effective code reviews become. 

  49. I can’t remember where I got the idea, but it’s arguably better to develop a process manually before automating it. In this way, you carefully identify the value in the process, and which parts of it would most benefit from automation. If you start with automation, you’re not starting from experience, and people may resent having to use tools that don’t fit their actual needs. This applies whether you’re building or buying automation tools and infrastructure.

    Of course, if you have past experience and existing, available tools, you can hit the ground running more quickly. The point is that it’s wasteful to wait for automation to appear when you could benefit from a process improvement now, even if it’s manual. 

  50. These next two statements defining “culture” are my paraphrase of a concept I discovered from an Apple internal essay. 

  51. This has been my tagline for years. I think I originally used it in the text of The Rainbow of Death from March 2017. 

  52. The Crossing the Chasm model can be traced back to Everett Rogers’s Diffusion of innovations model from 1962. That model differentiated the five populations, but lacked a “chasm.” The chasm was added by Lee James and Warren Schirtzinger of Regis McKenna Inc., where Moore also worked.

    Articles that dig into the Chasm’s history include:

    The first article above presents a number of criticisms of the Crossing the Chasm model. Like criticisms of the Test Pyramid model, I think they split hairs and miss the point. Not because their points aren’t valid, but because they’re better presented as further refinements for consideration after grasping the concept, not criticisms of the model.

    No model is perfect, but a good one is at least effective at bringing new people into the conversation. Once they’re in, and comfortable with the concepts and the language, we can point out nuances not captured by the model. But without the model, people may not gain access to the conversation to begin with. 

  53. I’ve had people suggest that Laggards are actually the dominant population, comprising the actual majority. I remind them that it only seems that way—they’re the most vocal because they feel they have something to lose. Once both Majorities adopt an innovation, their voices lose power. 

  54. Albert Wong, former Googler and member of the U.S. Digital Service. I saw his original model in his presentation on his early work as a member of the USDS, working with Citizenship and Immigration Services. In my mind, I instantly saw it snapping into the Chasm—and helping me make sense of the Google Testing Grouplet’s story.

    I asked Albert if I could borrow the model, and he agreed. I also asked if he minded me giving it a funny name, and he didn’t.

    The multicolored span of the model reminds me of rainbow, and my weird sense of humor inspired me to pair it with an incongruous concept. Hence, “The Rainbow of Death.”

    Two years after I started using the model, I realized how the concept of “Death” actually fits. The model helps explain how the problem you want to solve may not be the problem you have to solve first. To achieve that insight, old ideas about the problem and what’s required to solve it have to die to make room for new ideas.

    For example, the Testing Grouplet wanted to improve automated testing and software quality—but we had to figure out how to sell others on it. We eventually realized we needed to do more than train new hires once, host tech talks, and give out books. We kept doing those things, but we couldn’t only continue putting information out there in the hopes that people would use it. We realized we needed to get people more directly engaged—leading to Testing on the Toilet, Test Certified, the Test Mercenaries, and a series of Fixits. Our work also influenced build and testing infrastructure development, culminating in the launch of the Test Automation Platform.

    More to come in a following footnote… 

  55. The “Revolution” was the third Google-wide testing Fixit I organized, helping set up the TAP (Test Automation Platform) Fixit two years later. This event introduced Google’s now famous cloud based build and testing infrastructure to projects across the company. I named it after my favorite Beatles tune (tied with “I Am the Walrus”), leading to spectacularly Beatles-themed announcements, advertisements, prizes, etc.

    One of the neatest things was that, for weeks afterwards, I would hear people talking about “Revolutionizing” their builds. Even though not every project participated fully on the Fixit day, within a year, every project had migrated to the new infrastructure. I compare the before and after effects in Coding and Testing at Google, 2006 vs. 2011.

    I never got around to blogging about either the Revolution or the TAP Fixit before I ran out of steam writing about Google in 2012. Time has passed, and many memories have faded, but I may yet try to share what I’m able one day. 

  56. After developing the Rainbow of Death, I kept trying to use it as an answer key. I’d show it to people and expect them to “get it,” shortcut the exploration phase, get straight to implementation, and shave years off the process.

    After hitting the wall for about the third time, at Apple, I eventually realized it wasn’t an answer key, but a blueprint. Yes, it can help trained experts understand what the finished structure looks like. However, it has to come together over time, with many adjustments along the way. You have to find and purchase a site, prepare the site, put in the framing, then the electrical and plumbing infrastructure, etc. You can’t have the bulldozers, construction workers, roofers, siders, painters, and interior decorators all start at the same time. And the assumption is they are all already knowledgeable in what they need to do, and they’re already bought into doing it.

    Spreading adoption of good automated testing practices has its own order of dependencies—and you have to provide education and secure buy in as you go. Sharing the Rainbow of Death is fun and useful for existing Instigators, especially after completing the mission, showing how years of chaos converged into achievement. But it’s not the most effective tool for recruiting new Instigators and influencing the Early Majority. There really aren’t any shortcuts; it’s always going to take time.

    In other words, my own idea about how to approach the problem using the Rainbow of Death needed to die, so new ideas could emerge. Specifically, I needed to set aside the complexity of the Rainbow of Death, and embrace the “focus and simplify” principle as a starting point instead. 

  57. An Apple internal article used the example of Amundsen and Scott’s expeditions to the South Pole to illustrate the need to “focus and simplify.” Amundsen focused on getting there with the best sled dogs, succeeded on 1911-12-14, and survived. Scott tried a diversified approach, and did reach the South Pole on 1912-01-17, but he and his crew all died during the return trip.

    Other articles outside Apple highlight other differences in the mindset and leadership styles between the two. Amundsen was adaptable to conditions beyond his control, learned from the wisdom of others, assembled the most skilled team possible, and paid attention to details. Scott didn’t heed the weather, was casual about team composition and details, and plowed ahead through sheer assertion of confidence. Like Feynman later warned, nature will not be fooled by public relations.

  58. This is similar to Jim Collins’s Flywheel Effect

  59. The Rainbow of Death presentation uses the model to describe how the Testing Grouplet built up its efforts over time, one step at a time. We did try many things, some in parallel, but we tended to establish one major program at a time before focusing on establishing another. Ironically, I then later tried to use the model in several organizations to launch a bunch of efforts in parallel from the very start.

    Thankfully I finally learned my lesson at Apple, and got the Quality Culture Initiative to focus and simplify. First, we got our training program fully completed, launched, scheduled, and staffed. Then the internal podcast team got serious about publishing episodes more regularly. While I was focused on those things, another core QCI member got Quality Quest on a strong footing in his organization. We then merged it back into the QCI mainstream, allowing it to spread to other organizations.

    After that, we began experimenting again with other projects, some sticking, some not so much. Whenever a project seemed to stall, we’d invoke our “focus and simplify” mantra and pour that focus into more productive areas. 

  60. Of course Martin Fowler is famous for popularizing the term “refactoring” thanks to his book Refactoring: Improving the Design of Existing Code. He defines the term specifically thus (on the refactoring.com page):

    “Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

    “Its heart is a series of small behavior preserving transformations. Each transformation (called a ‘refactoring’) does little, but a sequence of these transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each refactoring, reducing the chances that a system can get seriously broken during the restructuring.”

    The spirit of this is encapsulated by a famous tweet from Kent Beck:

    “for each desired change, make the change easy (warning: this may be hard), then make the easy change”

    https://twitter.com/kentbeck/status/250733358307500032

    Also note on the refactoring.com page that Martin specifically asserts that “Refactoring is a part of day-to-day programming” and goes on to describe how. In the book, Martin gives this advice to those who still feel they need to ask for permission before refactoring anything:

    “Of course, many managers and customers don’t have the technical awareness to know how code base health impacts productivity. In these cases, I give my most controversial advice: Don’t tell!

    “Subversive? I don’t think so. Software developers are professionals. Our job is to build effective software as rapidly as we can. My experience is that refactoring is a big aid to building software quickly.”

  61. I learned about this specifically from Wolfgang Trumler, who I believe got it from Joshua Kerievsky’s book Refactoring to Patterns

  62. Remember from earlier that Apple’s goto fail bug was hidden by six copies of the same algorithm in the same file. To see how unit testing discipline could’ve caught or prevented this by discouraging duplication, see my article “Finding More Than One Worm in the Apple.” This example also illustrates that the Don’t Repeat Yourself (DRY) principle isn’t a purely academic concern.

    There’s a school of thought that suggests duplication is OK before landing on the correct abstraction. I consider this dangerous advice, because it’s so easily misunderstood and used to justify low standards. Programmers are notorious for taking shortcuts in code quality in order to move onto the next new thing to work on. They’re also notorious for using any available rationale to justify this behavior, and often disparaging more thoughtful approaches as “religion.” (Not that some can’t get carried away in the opposite direction—but it’s more common to find programmers attacking “religion” than programmers that are certifiable zealots.)

    I understand the utility of duplicating bits of code in one’s private workspace while experimenting with a new change. However, I think the fear of the potential costs of premature abstraction are overblown. The far, far greater danger is that of “experimental” duplication getting shipped, leading to hesitation to change shipping code. Instead of the “hasty abstraction” getting baked in, dangerous duplication gets baked in instead.

    After all, a premature abstraction should prove straightforward to reverse. Working with it should quickly reveal its shortcomings, which suggest refactoring it or breaking it apart in favor of duplicating its code for some reason. If it wasn’t premature, then making changes to the only copy is less time consuming and error prone than having to update multiple copies.

    Replacing duplication with a suitable abstraction after the fact should be easy, but it gives cover to potentially unnoticed bugs in the meanwhile. Again, goto fail illustrates how easy it is to miss bugs in duplicate code. Once you’ve seen the first copy, the rest tend to look the same, even if they’re not. Our brains are so eager to detect and match patterns that they trick us into skipping over critical details when we’re not careful. (I believe this is because we process duplicate code with “System 1” thinking instead of more expensive “System 2” thinking, per Thinking, Fast and Slow.) 

  63. We all know a 50 line code change is generally much faster and easier to review than a 500 line change. (500 lines of new or changed behavior, that is—500 lines of search and replace or deleted code is different.) Encouraging smaller reviews encourages decomposing larger changes into a series of smaller ones that can be independently tested, reviewed, and merged. This enables more thorough reviews, faster and more stable tests, and higher long term code quality and maintainability.

    Even so, some hold onto the dated belief that one should submit entire feature changes at once to avoid “dead code.” The thinking, I suppose, is that one risks introducing unused code if a larger change is introduced one piece at a time. The value judgment seems to be that unused code is a greater risk to quality than, say, poorly tested code.

    This, however, increases the risk of checking in “deadly code” that contains a bug that could harm users in some way. This is because larger changes are generally more difficult to test and review thoroughly. Overcompensating for poor design sense, poor communication, poor code quality, and poor process by mandating ill advised all-at-once changes can’t overcome those issues. In fact, it all but guarantees their perpetuation. 

  64. Of course, you’ll hear people make some variation of the excuse “It’s just test code” for writing sloppy tests. However, if the tests are there to ensure the quality and readiness of the production code, then the tests are part of our production toolchain. If a test fails, it should halt production releases until we’ve aligned the reality of the system’s behavior with our expectations (like Toyota’s andon cord). If a failure doesn’t warrant a halt in production, the test is a waste of resources (including precious developer attention) and should be removed. As such, our tests deserve as much respect and care as any other part of our value creating product or infrastructure. 

  65. In Working Effectively with Legacy Code, Michael Feathers defines “legacy code” thus:

    “To me, legacy code is simply code without tests.”

    —Preface, p. xvi

    His rationale, from the same page:

    “Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or how well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.”

    Of course, we can also change our code while preserving its behavior quickly and verifiably, for the purpose of refactoring

  66. Feathers further explains that “seams” are where we can change the behavior of code without changing the code itself. There are three kinds of seams:

    • Preprocessor seams use #define macros to rewrite the code in languages that use the C preprocessor.
    • Link seams use a static or dynamic linker, or the runtime loader, to change how a program binary is built or run. Examples include manipulating the LD_LIBRARY_PATH or CLASSPATH environment variables (or their equivalents in other languages’ build and runtime environments).
    • Polymorphic seams rely upon dependency injection to build an object graph at runtime. This allows the program itself to choose which implementations to include—such as test programs using test doubles to emulate production dependencies.

    Polymorphic seams are the most common and most flexible kind, as well as the first one we reach for to write testable code. The term is essentially synonymous with “dependency injection.” Preprocessor and link seams aren’t as flexible, scalable, or easy to use, but can work if you have no reasonable opportunity to introduce polymorphic seams.

    Note that using any seam successfully depends on the quality of the interface that defines it. The upcoming Scott Meyers quote speaks to that.

    See the earlier footnotes from The Test Pyramid on test doubles and internal APIs for some details on the benefits of dependency injection. 

  67. I first started using electrical outlets as an example in Automated Testing—Why Bother?:

    First, we need to understand the fundamental building block of testable code: Abstractions, as defined by interfaces. We create abstractions every time we write a class interface, or a module interface, or an application programming interface. And these abstractions perform two powerful functions:

    • They define a contract such that certain inputs will produce certain outputs and side-effects.
    • They provide seams between system components that allow for isolation between components.

    My favorite example of a powerful interface boundary is an electrical outlet. The shape of the outlet defines a contract between the power supplier and the power consumer, which remain thoroughly isolated from one another beyond the scope of that physical boundary.21 It’s easier to reason about both sides of the interface than if the consumer was wired directly into the source.

    In software, problems arise when we fail to consider one or the other of these functions, when either the contract isn’t rigorously defined and understood, or when the interfaces don’t permit sufficient isolation between components. This often happens when we fail to design our interfaces intentionally.

    In contrast, the more intentional our interfaces, the more natural our abstractions and seams. Automated testing obviously serves to validate the constraints of an interface contract; but the process of writing thorough, readable, reliable tests also encourages intentional interface design. “Testable” interfaces that enable us to exercise sufficient control in our tests tend to be “good” interfaces, and vice versa. In this way, testability forces a host of other tangible benefits, such as readability, composability, and extensibility.

    Basically, “testable” code is often just easy to work with!

    21 For a list of electrical plug and outlet specs used across the world, see: https://www.worldstandards.eu/electricity/plugs-and-sockets/

    Put more concretely: the power source could be anything from coal to wind, hydro, solar, or a hamster wheel. The consumer could be a lamp, a computer, or a wall of Marshall stacks. The shape of the outlet should ensure the voltage and amperage matches such that neither side cares what’s on the other—it all just works! A fault, failure, or other problem on one side won’t usually damage the other, either. This is especially true given common safety infrastructure such as surge protectors, fuses, and circuit breakers. Plus, you can use an electrical outlet tester as a test double to detect potential wiring issues.

    It also greatly simplifies debugging (also sampled from my 2022-12-23 email with Alex Buccino):

    • If a plugged in device stops working, but the lights are still on in your house/building, you can check a few things yourself. You can see if it’s unplugged, if a switch was flipped, if a fuse/breaker blew, or if the device itself is faulty. You can pinpoint and fix most of these issues quickly, with no need to worry about the electrical grid.

    • However, if all the lights went off in your house at the same time, the problem’s beyond your control. Unless you work for the electric company, you should be able to trust that the company will send a crew to resolve the issue shortly.

    • Were the device wired into the electrical system directly, however, your debugging and resolution would be more costly and risky. Also, the delineation of responsibility between yourself and the electric company might not be as clear.

    The common electrical outlet is a remarkably robust interface that unleashes enormous productivity every day—imagine if software in general was even remotely as reliable! 

  68. This is a paraphrase of a similar statement by my former colleague Max Goldstein. 

  69. I’m using “little-a agile” here, though this description certainly applies to the “capital-A Agile” methodology. I’m also reminded of the adage popularized by Dwight D. Eisenhower, “Plans are worthless, but planning is everything.

    The Quote Investigator page for the Eisenhower quote has a great summary as well:

    “The details of a plan which was designed years in advance are often incorrect, but the planning process demands the thorough exploration of options and contingencies. The knowledge gained during this probing is crucial to the selection of appropriate actions as future events unfold.”

  70. Recall that I defined “making software quality visible” as “providing meaningful insight into quality outcomes and the work necessary to achieve them.” 

  71. I find the insights from The Story Grid to be very helpful when it comes to thinking through the process and mechanics of storytelling. Though it’s not all directly applicable to technical storytelling, some concepts definitely translate, such as “core need/value,” “controlling idea,” and “objects of desire.” For a concise overview, see 1-Page Book Plan: The Story Grid Foolscap

  72. This advice is also congruent with Joel Schwartzberg’s Get to the Point! Sharpen Your Message and Make Your Words Matter. For more background on this particular phrase and the spelling of “lede,” see: Why Do We ‘Bury the Lede?’ The article’s apt subtitle is “We buried ‘lead’ so far down that we forgot how to spell it.” The introductory summary states:

    “A lede is the introductory section in journalism and thus to bury the lede refers to hiding the most important and relevant pieces of a story within other distracting information. The spelling of lede is allegedly so as to not confuse it with lead (/led/) which referred to the strip of metal that would separate lines of type. Both spellings, however, can be found in instances of the phrase.”

  73. This phrase was my extemporaneous response to a Slack comment during a major presentation. The commenter asserted, roughly, that the presentation had been abstract to that point, and they were waiting to be told what to do, please. I responded, “Practices need principles. We’re getting there.”

    More generally, the commenter’s apparent posture is a big part of why software quality issues continue to plague society. As a species, especially in the Internet era, we’re programmed to favor “System 1” thinking to jump to using the nearest shortcut.

    If we’ve already embraced the right principles and mindset, or absorbed the best possible examples earlier in our career, this is less of a problem. In that case, it may be more efficient to go straight to the examples if we already “get it.” My former Test Mercenaries colleague Paul Hammant suggests this in his Tutorials vs. Reference Docs vs. Examples blog post:

    “The more experienced developers get, the more likely they are to leave tutorial and api-doc as a way of gaining knowledge of a thing, and more toward examples.”

    But if we haven’t been exposed to good practices and the principles behind them already, we need at least some deliberate context building first. (Sadly, this is still the most common case, apparently.) When we see a new practice that runs counter to all the examples we’ve seen before, we need a little preparation first. Otherwise we may reflexively dismiss a potentially valuable new practice as pointless nonsense, absent sufficient context and insight to understand the problem it solves.

    BTW, earlier I referred to the commenter’s “apparent” posture, because I knew this person already “gets it.” I was a little surprised in the moment, but we worked out a fuller understanding later. Regardless, the commenter may not’ve found reviewing the principles personally useful, but many others were likely hearing them for the first time. Or if they have, I find there’s still value in hearing different people play the same tune in their own unique voice. 

  74. I picked up on using the term “mindset” deliberately and frequently after a chat with an executive that once helped me get hired. Once he said it, I knew that was a concept and a term that bore repeating early and often. (I was surprised I hadn’t thought to do so earlier!) After all, you can have all the knowledge and tools in the world and still be stuck, but with the right mindset, almost anything’s possible. 

  75. Originally we tried to come up with Quality Quest levels without directly referencing the exact Test Certified requirements. When that didn’t really work out, I copied the Test Certified requirements from my Testing on the Toilet blog post into a Confluence page. Then I asked everyone “What do we need to change to make this work for Apple?” The version we came up with then proved far more successful. 

  76. If you were paying attention to the description of the narrative arc—or at least to the agenda at the beginning—you should’ve seen this part coming! 

  77. I was inspired to create this list after reading The Leadership Campaign chapter “Step 6: DEFINE Everything.”