Copyright 2023 Mike Bland, licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
This presentation, Making Software Quality Visible, is my calling card. It describes my approach to driving software quality improvement efforts, especially automated testing, throughout an organization—applying systems thinking and leadership to influence its culture.
If you’d be interested in my help, please review the slides and script below. If my approach aligns with your sense of what your organization needs, please reach out to me at firstname.lastname@example.org. (Also see my Hire me! page for more information.)
NOTE AS OF 2023-02-06
Though the deck is essentially complete, I’m still finishing writing out the full script. I’ll announce the finished artifact in a blog post.
- NOTE AS OF 2023-02-06
- Google and Apple Examples
- What Software Quality Is and Why It Matters
- Why Software Quality Is Often Unappreciated and Sacrificed
- The Test Pyramid and Vital Signs
- Building a Software Quality Culture
- DONE TO HERE
- Calls to Action
This is the latest, most complete version of the slides. Click Open in Keynote to view the presenter’s notes and download a copy of the Keynote file:
This Google Drive folder contains the a copy of the Keynote presentation file as well as a PDF.
We’ll discuss why internal software quality matters, why it’s often unappreciated and sacrificed, and what we can do to improve it. More to the point, we’ll discuss the importance of instilling a quality culture to promote the proper mindset first. Only on this foundation will seeking better processes, better tools, better metrics, or AI-generated test cases yield the outcomes we can live with.
I’m Mike Bland, I’m a programmer, and I’m going to talk about how making software quality visible…
Software Quality must be visible to minimize suffering.
…will minimize suffering.1 By “suffering,” I mean: The common experience of software that’s painful—or even dangerous—to work on or to use. By “making software quality visible,” I mean: Providing meaningful insight into quality outcomes and the work necessary to achieve them.
Quality work can be hard to see. It’s hard to value what can’t be seen—or to do much of anything about it.
This is important because it’s often difficult to show quality work, or its impact on processes or products.2 How do we show the value of avoiding problems that didn’t happen? This makes it difficult to prioritize or justify investments in quality, since people rarely value work or results they can’t see. Plus, people can’t effectively solve problems they can’t accurately sense or understand.
Google and Apple Examples
I’ll share high-level examples of what I mean by making quality work visible to minimize suffering, based on my experience at Google and Apple.
What Software Quality Is and Why It Matters
We’ll define internal software quality and explain why it’s just as essential as external.
Why Software Quality Is Often Unappreciated and Sacrificed
We’ll examine several psychological and cultural factors that are detrimental to software quality.
The Test Pyramid and Vital Signs
We’ll use the Test Pyramid model to specify the principles underlying a sound testing strategy. We’ll also discuss the negative effects that its opposite, the Inverted Test Pyramid, imposes upon its unwitting victims. I’ll then describe how to use “Vital Signs” to get a holistic view on software quality for a particular project.
Building a Software Quality Culture
We’ll learn how to integrate the Quality Mindset into organizational culture, through individual skill development, team alignment, and making quality work visible.
Calls to Action
We’ll finish with specific actions everyone can take to lead themselves and others towards making software quality work and its impact visible.
Google and Apple Examples
Now here are some very brief summaries of my experiences with making quality work and its impact visible.
Google: Testing Grouplet
Rapid growth, hiring the “best of the best,” build/test tools not scaling
When I joined in 2005, the company was growing fast,4 and we knew we were “the best of the best.” However, our build and testing tools and infrastructure weren’t keeping up.
Lack of widespread, effective automated testing and continuous integration; frequent broken builds and “emergency pushes” (deployments)
Developers weren’t writing nearly enough automated tests, the ones they wrote weren’t that good, and few projects used continuous integration. As a result, code frequently failed to compile, and errors that made it to production would frequently lead to “emergency pushes,” or deployments.
Training, GWS Story, Test Certified, Test Mercenaries, Testing on the Toilet
Over time, the Testing Grouplet trained new hires, shared the Google Web Server story, and developed the Test Certified roadmap program. We built the Test Mercenaries internal consulting team, and our biggest hit was our Testing on the Toilet newsletter, appearing weekly in every company bathroom.
Two Testing Fixits, Revolution (Build Tools) Fixit, Test Auto. Platform Fixit
We also ran several “Fixits,” companywide events to address “important but not urgent” issues. Our Fixits inspired people to write and fix tests, to adopt new build tools, and finally, to adopt the Test Automation Platform continuous integration system.
Better tests, practices, build and test times; fewer bugs, more productivity
These efforts made quality work and its impact more visible than it had been. This helped people write better tests, adopt better testing practices and strategies, drastically improve build and test times, reduce bugs, and increase productivity. But perhaps the most visible result was scalability of the organization.
Google: Testing Grouplet results
2015, R. Potvin, Why Google Stores Bills. of LoC in a Single Repo
Rachel Potvin presented the following results in her presentation from @Scale 2015, “Why Google Stores Billions of Lines of Code in a Single Repository.” They may seem quaint to Googlers today, but they speak to the Testing Grouplet’s enduring impact five years after the TAP Fixit.
- 15 million LoC in 250K files changed by humans per week
- 15K commits by humans, 30K commits by automated systems per day
- 800K/second peak file requests
Of course, the Testing Grouplet isn’t responsible for all of this; Rachel’s talk describes an entire ecosystem of tools and practices. Even so, she states very clearly that:
- “TAP is our automated test infrastructure, without which this model would completely fall apart.” (13m:36s)
Also, it may amuse you to know that Testing on the Toilet, started in 2006, continues to this day!5
Apple: Quality Culture Initiative
Rapid growth, hiring the “best of the best,” build/test tools not scaling
When I joined in 2018, the company was growing fast, and we knew we were “the best of the best.” However, our build and testing tools and infrastructure weren’t keeping up.
Widespread automated and manual testing, but…
There was a strong testing culture, but not around unit testing.
“Testing like a user would” often considered most important
With so much emphasis on the end user experience, many believed that “testing like a user would” was the most important kind of testing.
Tests often large, UI-driven, expensive, slow, flaky, and ineffective
As a result, most tests were user interface driven, requiring full application and system builds on real devices. Since writing smaller tests wasn’t often considered, this led to a proliferation of large, expensive, slow, unreliable, and ineffective tests, generating waste and risk.
Apple: Quality Culture Initiative results
QCI activity as of November 2022—internal results confidential
It’s too early for the QCI to declare victory, and specific results to date are confidential. However, I can broadly describe the state of the QCI’s efforts by the time I left Apple in November 2022.
Training: 16 courses, ~40 volunteer trainers, ~360 sessions, ~6100 check-ins, ~3200 unique individuals
We launched an ambitious, wildly successful, and thriving training program to spread good coding and testing practices.
Internal podcast: 45 episodes and 500+ subscribers
Our internal podcast gave a voice to people of various roles and organizations across Apple to help drive the software quality conversation.
Quality Quest roadmap: ~80 teams, ~20 volunteer guides
Our Quality Quest roadmap, directly inspired by Test Certified, is helping teams across Apple improve their quality practices and outcomes.
QCI Ambassadors: 6 organizations started, 6 on the way
QCI Ambassadors help their organizations apply QCI principles, practices, and programs to achieve their quality goals.
QCI Roadshow: over 50 presentations
The QCI Roadshow helped us introduce QCI concepts and programs directly to groups across the company.
QCI Summit: ~50 recorded sessions, ~60 presenters, ~850 virtual attendees
Our QCI Summit event recruited presenters from across Apple to make their quality work and impact visible. We saw how QCI principles applied to operating systems and services, applications, frontends and backends, machine learning, internal IT, and development infrastructure.
What’s in a name?
What we realized three years after choosing it
One nice feature about the name “Quality Culture Initiative” that we didn’t realize for three years was how it encoded the total Software Quality solution:
Quality is the outcome we’re working to achieve, but as I’ll explain, achieving lasting improvements requires influencing the…
Culture. Culture, however, is the result of complex interactions between individuals over time. Any effective attempt at influencing culture rests upon systems thinking, followed by taking…
Initiative to secure widespread buy in for systemic changes. Selling a vision for systemic improvement and supporting people in pursuit of that vision requires leadership.
What Software Quality Is and Why It Matters
As leaders, we need to clearly define what we mean by software quality, and explain why it’s so important.
Is High Quality Software Worth the Cost?
He distinguished between:
External quality, which obviously makes users happy. This, in turn, keeps developers productive, since they don’t need to respond to problems reported by users. Then Martin argues that…
Internal quality helps keep users happy, by enabling developers to evolve the software easily and to resolve problems quickly. This is because high internal quality makes developers productive, since there’s less cruft and unnecessary complexity slowing them down from making changes.
Effects of quality on productivity over time
Martin also used this hypothetical graph, based on his experience, to illustrate the impact of quality tradeoffs over time.
With Low internal quality, progress is faster at the beginning, but begins to flatten out quickly.
With High internal quality, progress is slower at the beginning, but the investment pays off in greater productivity over time.
The break even point between the two approaches arrives within weeks, not months.
Though Martin’s original graph didn’t show this, the difference in productivity between low and high internal quality is one way to visualize technical debt.7
“Fast, cheap, or good: pick
High quality software is cheaper to produce
Martin’s conclusion is that higher quality makes software cheaper to produce in the long run—that the “cost” of high quality software is negative. “Fast, cheap, or good: pick two” doesn’t hold as the system evolves. It may make sense at first to sacrifice good to get a cheaper product to market quickly. But over time, investing in “good” is necessary to continue delivering a product quickly and at a competitive cost.
Internal quality aids understanding
Clarity increases productivity and resilience, manages risk
Internal quality essentially helps developers continue to understand the system as it changes over time:8
Fosters productivity due to the clarity of the impact of changes
When they clearly understand the impact of their changes, they can maintain a rapid, productive pace.
Prevents foreseeable issues, limits recovery time from others
Understanding helps them prevent many foreseeable issues, and resolve any bugs quickly and effectively.
Provides a buffer for the unexpected, guards against cascading failures
These qualities help create a buffer for handling unexpected events,9 while also guarding against cascading failures.
Your Admins/SREs will thank you! It helps them resolve prod issues faster.
Your system administrators or SREs will be very grateful for building such resilience into your system, as it helps their response times as well.
Counterexamples: Global supply chain shocks; Southwest Airlines snafu
For counterexamples, recall the global supply chain shocks resulting from the COVID-19 pandemic, or the December 2022 Southwest Airlines snafu. These systems worked very efficiently in the midst of normal operating conditions. However, their intolerance for deviations from those conditions rendered them vulnerable to cascading failures.
Quality, clarity, resilience are essential requirements of prod systems
Consequently, internal software quality, and the clarity and resilience it enables, are essential requirements of any production software system.
Focusing on internal software quality is good for business…because it’s the right thing to do.
As mentioned, Martin Fowler’s argument is that internal software quality is good for business—it’s such a compelling argument that I brought it up first. However, he prefers making only this economic argument for quality. He asserts that appeals to professionalism are moralistic and doomed, as they imply that quality comes at a cost.10
I disagree that we should sidestep appeals to professionalism entirely, and that they’re incompatible with the economic argument. I think it’s worth exploring why professionalism matters, both because it is moral and because customers increasingly expect high quality software they can trust.
Quality without function is useless—but function without quality is dangerous.
Put more bluntly, high quality may be useless without sufficient functionality, but as we’ll see, functionality without quality can be dangerous. Let’s look at a few examples.
Northrop Grumman Mission Systems
Navigation for US Coast Guard vessels or US Navy submarines
My first professional programming project was a nautical chart renderer for a navigation system used by Coast Guard vessels and Navy nuclear submarines.
Requirement: Enumerate chart features
One day our product owner sent us some code to enumerate nautical chart features from a file.
Assumption: In memory size == on disk size
The code assumed each record was the same size in memory as it was on disk.
Reality: 21 bytes on disk, 24 in memory
However, the records were 21 bytes on disk, but the in memory structs were 24 bytes, thanks to byte padding.
Outcome: File size/24 == 12.5% data loss
As a result, this code ignored one eighth of the chart features in the file.
Impact: Caught before shipping!
Fortunately I caught this before it shipped to any nuclear submarines.
Apple’s goto fail
Finding More Than One Worm in the Apple, CACM, July 2014
In February 2014, Apple had to update part of its Secure Transport component…
Requirement: Apply algorithm multiple times
…which applied the same algorithm in six places.
Assumption: Short algorithms safe to copy
The developers apparently assumed that this short, ten line algorithm was safe to copy in its entirety, instead of making it a function.
Reality: Copies may not stay identical
One problem with duplication is that the copies may not remain identical.
Outcome: One of six copies had a bug
As it so happened, one of the six copies of this algorithm picked up an extra “goto” statement that short circuited a security handshake.
Impact: Billions of devices
Once it was discovered and patched, Apple had to push an emergency update to billions of devices. It’s unknown whether it was ever exploited.
My article “Finding More Than One Worm in the Apple” explains how this bug could’ve been caught, or prevented, by a unit test.
In April 2014, OpenSSL had to update its “heartbeat” feature…
Requirement: Echo message from request
…which echoed a message supplied by a user request.
Assumption: User-supplied length is valid
The code assumed that the user supplied message length matched the actual message length.
Reality: Actual message may be empty
In fact, the message could be completely empty.
Outcome: Server returns arbitrary data
In that case, the server would hand back however many bytes of its own memory that the user requested, including secret key data.
Impact: Countless HTTPS servers
Countless HTTPS servers had to be patched. It’s unknown whether it was ever exploited.
My article “Goto Fail, Heartbleed, and Unit Testing Culture” explains how this bug could’ve been caught, or prevented, by a unit test.
The point is that…
Quality Culture is ultimately Safety Culture
…a culture that values and invests in software quality is a Safety Culture. Society’s dependence on software to automate critical functions is only increasing. It’s our duty to uphold that public trust by cultivating a quality culture.
Why Software Quality Is Often Unappreciated and Sacrificed
Our next step as leaders is to understand, if software quality is so important, why it’s so often unappreciated and sacrificed.
Automated (especially unit) testing
Why hasn’t it caught on everywhere yet?
- Apple, in January 1992, identified the need to make time for training,
documentation, code review—and unit testing!
At Apple, I found a document from January 1992 specifically identifying the need to make time for training, documentation, code review—and unit testing! That’s not just before Test-Driven Development and the Agile Manifesto, that’s before the World Wide Web!
There are a few reasons why unit testing in particular hasn’t caught on everywhere yet:11
People think it’s obvious and easy—therefore lower value
Many developers think it’s obvious and easy, and therefore can’t provide much value.12
Many still haven’t seen it done well—or may have seen it done poorly
Many others still haven’t seen it done well, or may have seen it done poorly, leading to the belief that it’s actually harmful.
There’s always a learning curve involved
For those actually open to the idea, there’s still a learning curve, which they many not have the time to climb.
Bad perf review if management doesn’t care about testing/internal quality
They may also fear spending time on testing and internal quality will result in a bad performance review if their management doesn’t care about it.13
Have you heard these ones before?
Common excuses for sacrificing unit testing and internal quality
Of course, people give their own reasons for not investing in testing and quality, including the following:
Tests don’t ship/internal quality isn’t visible to users (i.e., cost, not investment)
“We don’t ship tests, and users don’t care about internal quality.” Meaning, testing seems like a cost, not an investment.
“Testing like a user would” is the most important kind of testing
As mentioned before, “Testing like a user would” is considered most important, so investing in smaller tests and internal quality seems unnecessary.
Straw man: 100% code coverage is bullshit
The straw man that “writing tests to get 100% code coverage is bullshit.” This speaks to a fundamental ignorance about how to write good tests or to use coverage the right way.
Straw man: Testing is a religion (implying: I’m better than those people)
For some reason, technical people, especially programmers, like to pound their chests as being against so called testing “religion” and those who practice it. It’s a flimsy excuse for trying to score social points by virtue signaling in view of one’s perceived peers. Framing a potentially reasonable discussion of different testing ideas in such a way only serves to shut it down for a superficial, unprofessional ego boost.14
“I don’t have time to test.”
Finally, “I don’t have time to test.” This could be a brush off, or a genuine indication that they don’t know how and can’t spare the time to learn—and management doesn’t care.
Business as Usual
All these reasons are why Business as Usual persists, as well as the Complexity, Risk, Waste, and Suffering that everyone’s used to. This then allows the Normalization of Deviance to take hold.
Normalization of Deviance (paraphrased)
Coined by Diane Vaughan in The Challenger Launch Decision
Diane Vaughan introduced this term in her book about the Space Shuttle Challenger explosion in January 1986.15 My paraphrased version of the definition is: A gradual lowering of standards that becomes accepted, and even defended, as the cultural norm.
Space Shuttle Challenger Accident Report
history.nasa.gov/rogersrep/genindex.htm, Chapter VI, pp. 129-131
The O-rings didn’t just fail on the Challenger mission on 1986-01-28…
Many of us may know that the O-rings lost elasticity in the cold weather, allowing gasses to escape which led to the explosion.
…anomalies occurred in 17 of the 24 (70%) prior Space Shuttle missions…
However, you may not realize that NASA detected anomalies in O-ring performance in 17 of the previous 24 shuttle missions, a 70 percent failure rate.
…and in 14 of the previous 17 (82%) since 1984-02-03
Even scarier, anomalies were detected in 14 of the previous 17 missions, for an 82% failure rate.
Multiple layers of engineering, management, and safety programs failed
This wasn’t only one person’s fault—multiple layers of engineering, management, and safety programs failed.17 However, Normalization of Deviance isn’t the end of the problem.
NASA: NoD leads to Groupthink
Terry Wilcutt and Hal Bell of NASA delivered their presentation The Cost of Silence: Normalization of Deviance and Groupthink in November 2014.18 On the Normalization of Deviance, they noted that:
“There’s a natural human tendency to rationalize shortcuts under pressure, especially when nothing bad happens. The lack of bad outcomes can reinforce the ‘rightness’ of trusting past success instead of objectively assessing risk.”
—Terry Wilcutt and Hal Bell, The Cost of Silence: Normalization of Deviance and Groupthink
They go on to cite the definition of Groupthink from Irving Janis:
“[Groupthink is] a quick and easy way to refer to a mode of thinking that persons engage in when they are deeply involved in a cohesive in-group, when concurrence-seeking becomes so dominant that it tends to override critical thinking or realistic appraisal of alternative courses of action.”
NASA: Symptoms of Groupthink
They then describe the symptoms of Groupthink:19
- Illusion of invulnerability—because we’re the best!
- Belief in Inherent Morality of the Group—we can do no wrong!
- Collective Rationalization—it’s gonna be fine!
- Out-Group Stereotypes—don’t be one of those people!
- Self-Censorship—don’t rock the boat!
- Illusion of Unanimity—everyone goes along to get along!
- Direct Pressure on Dissenters—because everyone else disagrees!
- Self-Appointed Mindguards—decision makers exclude subject matter experts from the conversation.
Any of these sound familiar? Hopefully from past, not current, experiences.
A common result of Groupthink is the well known systems thinking phenomenon called “The Cobra Effect.”
The Cobra Effect
Pay bounty for dead cobras
This comes from the true story of when the British administration in India offered people a bounty to help reduce the cobra population.
Cobras disappear, but still paying
This worked, but the British noticed they kept paying bounties when they didn’t see any more cobras around.
People were harvesting cobras
They realized people were raising cobras just to collect the bounty…
Ended bounty program
…so they ended the bounty program.
More cobras in streets than before!
People then threw their now useless cobras into the streets, making the problem worse than before.
Fixes that Fail: Simplistic solution, unforeseen outcomes, worse problem
This is an example of the “Fixes That Fail” archetype. This entails applying an overly simplistic solution to a complex problem, resulting in unforeseen outcomes that eventually make the problem worse.
The Arms Race
Systems thinking should replace brute force in the long run.
In software, I call this “The Arms Race.” This may sound familiar:
Investment to create capacity for existing practices and processes…
We invest people, tools, and infrastructure into expanding the capacity of our existing practices and processes.
Exhaustion of capacity leads to more people, tools, and infrastructure
Things are better for a while, but as the company and its projects and their complexity grow, that capacity’s eventually exhausted. This leads to further investment of people, tools, and infrastructure.
Then that capacity’s exhausted…
Then eventually that capacity’s exhausted…and the cycle continues.
AI-generated tests may help people get started in some cases…
There’s a lot of buzz lately about possibly using AI-generated tests to take the automated testing burden off of humans. There may perhaps be room for using AI-generated tests as a starting point.
…but beware of abdicating professional responsibility to unproven tools.
However, AI can’t read your mind to know all your expectations—all the requirements you’re trying to satisfy or all the assumptions you’re making.20 Even if it could, AI will never absolve you of your professional responsibility to ensure a high quality and trustworthy product.21
In the end, we can’t win the arms race against growth and complexity.
We need to realize we can’t win the Arms Race against growth and complexity.
Rogers Report Volume 2: Appendix F
Richard Feynman’s final statement on the Challenger disaster is a powerful reminder of our human limitations:22
“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.”
—Richard Feynman, Personal Observations on Reliability of Shuttle
As professionals, we must resist the Normalization of Deviance, Groupthink, and the Arms Race. We can’t allow them to become, or to remain, Business as Usual.
Don’t shame people for doing Business as Usual. Help them recognize and change it.
But we can’t shame anyone for falling into these common traps, because it’s human nature to fall into them. We need to help everyone recognize them, focus on avoiding them, and remain vigilant against them, and change Business as Usual together.
Challenging & Changing Business as Usual
Changing cultural norms requires understanding them first.
Challenging cultural norms supporting Business as Usual isn’t easy, and it’s honestly frightening. There’s actually a good reason for that:
Everything in a culture happens for a reason—challenge reasons thoughtfully!
The existing norms do exist for a reason. The question is whether that reason holds up today. So we need to try to understand those reasons first, then challenge them thoughtfully.23
People often feel invested in old ways and fear the cost of new ways.
This is because many people feel invested in their existing methods. They fear the cost and risk of changing those methods, even if they no longer provide the value they once did. This is common human nature…
We’re often asked for data to prove a different approach actually helps—before trying it…
…and is why some will ask for data or other proof that a change will be effective before trying it.
…while we throw time, money, people, and tools at existing processes.
At the same time, they’ll continue throwing resources into existing processes as they have for years.
The most productive way to approach such a challenge requires taking the time to gather enough information and built some trust.
- Challenge: Haven’t we proven that the existing ways aren’t (totally)
You can then carefully question whether current methods are effective, or effective enough on their own, given substantial historical evidence to the contrary. On this basis, you may persuade some to reexamine the problem and try a different approach.
Notice that this challenge to the status quo is in the form of a thoughtful question.
The power of asking good questions
“What could we do differently to improve our software quality?”
The ultimate question is “What could we do differently to improve our software quality?”24 However…
Questions develop shared understanding of the culture before changing it
…there are many more questions necessary to help everyone understand why things are the way they are, what needs to change—and how.
Asking encourages thinking about a problem, possible solutions
Asking good questions includes people in the process of discovery and finding solutions, which develops their own knowledge and problem solving skills.
Good questions enable new information, perspectives, and ideas to emerge
Good questions enable people to share information, perspectives, and ideas that wouldn’t otherwise arise if they were only told what to think or do.
Asking what to do is more engaging than telling what to do—produces buy in
Taking the time to ask people questions pulls people into the change process, which increases their motivation to buy into any proposed changes that emerge.
- “If you want to cut a man’s hair, it is better if he is in the room.”
There is one catch, though.
If people don’t know where to begin, are stuck in old ways, or are under stress…direct near term guidance may be necessary.
Your audience may get stuck. They may currently have no idea about what to change or how—because they lack knowledge, experience, or imagination to consider approaches beyond the status quo. Or, as we’ll discuss shortly, they may be under incredible stress and unable to think clearly or creatively.
In that case, you may need to provide more direct guidance, at least in the beginning.25 So here are some tools for providing that guidance.
The Test Pyramid and Vital Signs
Lasting improvements to software quality—and Business as Usual—begin with making quality work and its impact visible at a fundamental level. The Test Pyramid and Vital Signs are concepts accessible to any team committed to this principle. Applying these models is the first step towards ending the suffering caused by the Normalization of Deviance, Groupthink, and the Arms Race.
Before getting into the details, let’s understand the specific problems we need to solve.
Working back from the desired experience
Inspired by Steve Jobs Insult Response
In this famous Steve Jobs video, he explains the need to work backward from the customer experience, not forward from the technology. So let’s compare the experience we want ourselves and others to have with our software to the experience many of us may have today.
|What we want||What we have|
What we want
We want to experience Delight from using and working on high quality software,26 which largely results from the Efficiency high quality software enables. Efficiency comes from the Confidence that the software is in good shape, which arises from the Clarity the developers have about system behavior.
What we have
However, we often experience Suffering from using or working on low quality software, reflecting a Waste of excess time and energy spent dealing with it. This Waste is the result of unmanaged Risk leading to lots of bugs and unplanned work. Bugs, unplanned work, Risk, and fear take over when the system’s Complexity makes it difficult for developers to fully understand the effect of new changes.
Difficulty in understanding changes produces drag— i.e., Technical debt.
Difficulty in understanding how new changes could affect the system is the telltale sign of low internal quality, which drags down overall quality and productivity. Recall from earlier that the difference between actual and potential productivity correlated to internal quality is what we identified earlier as technical debt.
This contributes to the common scenario of a crisis emerging…
Replace heroics with a Chain Reaction!
…that requires technical heroics and personal sacrifice to avert catastrophe. To get a handle on avoiding such situations, we need to create the conditions for a positive Chain Reaction. By creating and maintaining the right conditions over time, we can achieve our desired outcomes without stress and heroics.
The main obstacle to replacing heroics with a Chain Reaction isn’t technology…
The challenge is belief—not technology
A little awareness goes a long way
Many of these problems have been solved for decades
Despite the fact that many quality and testing problems have been solved for decades…29
Many just haven’t seen the solutions, or seen them done well…
…many still haven’t seen those solutions, or seen them done well.30
The right way can seem easy and obvious—after someone shows you!
The good news is that these solutions can seem easy and obvious—after they’ve been clearly explained and demonstrated.31
What does the right way look like?
So how do we get started showing people what the right way to improve software quality looks like?
The Test Pyramid
A balance of tests of different sizes for different purposes
We’ll start with the Test Pyramid model,32,33 which represents a balance of tests of different sizes for different purposes. Realizing that tests can come in more than one size is often a major revelation to people who haven’t yet been exposed to the concept.34 It’s not a perfect model—no model is—but it’s an effective tool for pulling people into a productive conversation about testing strategies for the first time.35
|Dependencies||Control/ Reliability/ Independence||Resource usage/ Maint. cost||Speed/ Feedback loop||Confidence|
|Details not visible||All||Low||High||Slow||Entire
|Components, services||Developers, some QA||Some details visible||As few as possible||Medium||Medium||Faster||Contract
|Functions, classes||Developers||All details visible||Few to none||High||Low||Fastest||Low level details, individual changes|
The Test Pyramid helps us understand how different kinds of tests give us confidence in different levels and properties of the system.36 It can also help us break the habit of writing large, expensive, flaky tests by default.37
Small tests are unit tests that validate only a few functions or classes at a time with very few dependencies, if any. They often use test doubles38 in place of production dependencies to control the environment, making the tests very fast, independent, reliable, and cheap to maintain. Their tight feedback loop39 enables developers to detect and repair problems very quickly that would be more difficult and expensive to detect with larger tests. They can also be run in local and virtualized environments and can be parallelized.
Medium tests are integration tests that validate contracts and interactions with external dependencies or larger internal components of the system. While not as fast or cheap as small tests, by focusing on only a few dependencies, developers or QA can still run them somewhat frequently. They detect specific integration problems and unexpected external changes that small tests can’t, and can do so more quickly and cheaply than large system tests. Paired with good internal design, these test can ensure that test doubles used in small tests remain faithful to production behavior.40
Large tests are full, end to end system tests, often driven through user interface automation or a REST API. They’re the slowest and most expensive tests to write, run, and maintain, and can be notoriously unreliable. For these reasons, writing large tests by default for everything is especially problematic. However, when well designed and balanced with smaller tests, they cover important use cases and user experience factors that aren’t covered by the smaller tests.
Thoughtful, balanced strategy == Reliability, efficiency
Each test size validates different properties that would be difficult or impossible to validate using other kinds of tests. Adopting a balanced testing strategy that incorporates tests of all sizes41 enables more reliable and efficient development and testing—and higher software quality, inside and out.
Inverted Test Pyramid
Many larger tests, few smaller tests
Of course, many projects have a testing strategy that resembles an inverted Test Pyramid, with too many larger tests and not enough smaller tests. This leads to a number of common problems:
Tests tend to be larger, slower, less reliable
The tests are slower and less reliable than they could be compared to relying more on smaller tests.
Broad scope makes failures difficult to diagnose
Because large tests execute so much code, it might not be easy to tell what caused a failure.
Greater context switching cost to diagnose/repair failure
That means developers have to interrupt their current work to spend significant time and effort diagnosing and fixing any failures.
Many new changes aren’t specifically tested because “time”
Since most of the tests are large and slow, this incentivizes developers to possibly skip writing or running them because they “don’t have time.”
People ignore entire signal due to flakiness…
Worst of all, since large tests are more prone to be flaky,42 people will begin to ignore test failures in general. They won’t believe their changes cause any failures, since the tests were failing before—they might even be flagged as “known failures.”43 And as we’ll recall…
…fostering the Normalization of Deviance
…the Space Shuttle Challenger O-rings suffered from “known failures” as well, cultivating the Normalization of Deviance.
Let’s go over some of the reasons behind this situation, touching on some of the same reasons we covered before.
Features prioritized over internal quality/tech debt
People are often pressured to continue working on new features that are “good enough” instead of reducing technical debt. This may be especially true for organizations that set aggressive deadlines and/or demand frequent live demonstrations.44
“Testing like a user would” is more important
Again, if “testing like a user would” is valued more than other kinds of testing, then most tests will be large and user interface-driven.
Reliance on more tools, QA, or infrastructure (Arms Race)
This also tends to instill the mindset that the testing strategy isn’t a problem, but that we always need more tools, infrastructure, or QA headcount. This is the Arms Race mindset we discussed earlier.
Landing more, larger changes at once because “time”
Because the existing development and testing process is slow and inefficient, individuals try to optimize their productivity by integrating large changes at once. These changes are unlikely to receive either sufficient testing or sufficient code review, increasing the risk of bugs slipping through. It also increases the chance of large test failures that aren’t understood. The team is inclined to tolerate these failures, because there isn’t “time” to go back and redo the change the right way.
Lack of exposure to good examples or effective advocates
As mentioned before, many people haven’t actually witnessed or experienced good testing practices before, and no one is advocating for them. This instills the belief that the current strategy and practices are the best we can come up with.
We tend to focus on what we directly control—and what management cares about! (Groupthink)
In such high stress situations, it’s human nature to focus on doing what seems directly within our control in order to cope. Alternatively, we tend to prioritize what our management cares about, since they have leverage over our livelihood and career development. It’s hard to break out of a bad situation when feeling cornered—and too easy to succumb to Groupthink without realizing it.
So how do we break out of this corner—or help others to do so?
Quality work can be hard to see. It’s hard to value what can’t be seen—or to do much of anything about it.
We have to overcome the fundamental challenge of helping people see what internal quality looks like. We have to help developers, QA, managers, and executives care about it and to resist the Normalization of Deviance and Groupthink. We need to better show our quality work to help one another improve internal quality and break free from the Arms Race mindset.
In other words, internal quality work and its impact is a lot like The Matrix…
“Unfortunately, no one can be told what the Matrix is. You have to see it for yourself.”
—Morpheus, The Matrix
One way to start showing people The Matrix is to get buy-in on a set of…
…“Vital Signs.” Vital Signs are…
A collection of signals designed by a team to reflect quality and productivity and to rapidly diagnose and resolve problems
Comprehensive and make sense to the team and all stakeholders.
They should be comprehensive and make sense at a high level to everyone involved in the project, regardless of role.
Not merely metrics, goals, or data
We’re not collecting them for the sake of saying we collect them, or to hit a goal one time and declare victory.45
Information for repeated evaluation
We’re collecting them because we need to evaluate and understand the state of our quality and productivity over time.
Inform decisions whether or not to act in response
These evaluations will inform decisions regarding how to maintain the health of the system at any moment.
Some common signals include:
Pass/fail rate of continuous integration system
The tests should almost always pass, but failures should be meaningful and fixed immediately.
Size, build and running time, and stability of small/medium/large test suites
The faster and more stable the tests, the fewer resources they consume, and the more valuable they are.
Size of changes submitted for code review and review completion times
Individual changes should be relatively small, and thus easier and faster to review.
Code coverage from small to medium-small test suites
Each small-ish test should cover only a few functions or classes, but the overall coverage of the suite should be as high as possible.46
Passing use cases covered by medium-large to large and manual test suites
For larger tests, we’re concerned about whether higher level contracts, use cases, or experience factors are clearly defined and satisfied before shipping.
Number of outstanding software defects and Mean Time to Resolve
Tracking outstanding bugs is a very common and important Vital Sign. If you want to take it to the next level, you can also begin to track the Mean Time to Resolve47 these bugs. The lower the time, the healthier the system.
Other potentially meaningful signals
Some other potentially meaningful signals include…
Static analysis findings (e.g., complexity, nesting depth, function/class sizes)
Popular source control platforms, such as GitHub, can incorporate static analysis findings directly into code reviews as well. This encourages developers to address findings before they land in a static analysis platform report.48
Dependencies contribute to system and test complexity, which contribute to build and test times. Cutting unnecessary dependencies and better managing necessary ones can yield immediate, substantial savings.
Power, performance, latency
These user experience signals aren’t caught by traditional automated tests that evaluate logical correctness, but are important to monitor.
Anything else the team finds useful for its purposes
As long as it’s a clear signal that’s meaningful to the team, include it in the Vital Signs portfolio.
Use them much like production telemetry
Treat Vital Signs like you would any production telemetry that you might already have.
Keep them current and make sure the team pays attention to them.
Clearly define acceptable levels—then achieve and maintain them.
Identify and respond to anomalies before urgent issues arise.
Encourage continuous improvement—to increase productivity and resilience.
Use them to tell the story of the system’s health and team culture.
To this last point, we’ll return to the importance of storytelling later.
Example usage: issues, potential causes (not exhaustive!)
Here are a few hypothetical examples of how Vital Signs can help your team identify and respond to issues.
Builds 100% passing, high unit test coverage, but high software defects
If your builds and code coverage are in good shape, but you’re still finding bugs…
- Maybe gaps in medium-to-large test coverage, poorly written unit
…it could be that you need more larger tests. Or, it could be your unit tests aren’t as good as you think, executing code for coverage but not rigorously validating the results.
- Maybe gaps in medium-to-large test coverage, poorly written unit tests
Low software defects, but schedule slipping anyway
If you don’t have many bugs, but productivity still seems to be dragging…
- Large changes, slow reviews, slow builds+tests, high dependency fan
…maybe people are still sending huge changes to one another for review. Or maybe your build and test times are too slow, possibly due to excess dependencies.
- Large changes, slow reviews, slow builds+tests, high dependency fan out
Good, stable, fast tests, few software defects, but poor app performance
Maybe builds and tests are fine, and there are few if any bugs, but the app isn’t passing performance benchmarks.
- Discover and optimize bottlenecks—easier with great testing already in
In that case, your investment in quality practices has paid off! You can rigorously pursue optimizations, without the fear that you’ll unknowingly break behavior.
- Discover and optimize bottlenecks—easier with great testing already in place!
Getting started, one small step at a time
Here are a few guidelines for getting started collecting Vital Signs. First and foremost…
Don’t get hung up on having the perfect tool or automation first.
Do not get hung up on thinking you need special tools or automation at the beginning. You may need to put some kind of tool in place if you have no way to get a particular signal. But if you can, collect the information manually for now, instead of wasting time flying blind until someone else writes your dream tool.
Start small, collecting what you can with tools at hand, building up over time.
You also don’t need to collect everything right now. Start collecting what you can, and plan to collect more over time.
Focus on one goal at a time: lowest hanging fruit; biggest pain point; etc.
As for which Vital Signs to start with, that’s totally up to you and your team. You can start with the easiest signals, or the ones focused on your biggest pain points—it doesn’t matter. Decide on a priority and focus on that first.
Update a spreadsheet or table every week or so—manually, if necessary.
If you don’t have an automated collection and reporting system handy, then use a humble spreadsheet or wiki table. Spend a few minutes every week updating it.
Observe the early dynamics between the information and team practice.
Discuss these updates with your team, and see how it begins to shift the conversation—and the team’s behavior.
Then make a case for tool/infrastructure investment based on that evidence.
Once you’ve got evidence of the value of these signals, then you can justify and secure an investment in automation.49
Building a Software Quality Culture
By asking good questions, spreading awareness of the Test Pyramid, and making internal quality visible via Vital Signs, you’re shifting the culture of your team. The next step is to change the culture of your organization.
First, let’s define specifically what we mean by culture.50 One possible definition is that:
Culture is the shared lifestyle of a team or organization.
This lifestyle is what you see people doing together day to day, and the way they do it. For our purposes, we need to understand the essence of lifestyle, where it comes from and what shapes it. So here’s an expanded definition of “culture”:
Culture is the emergent result of a shared mindset manifest through concrete behaviors.
In order to influence lifestyle, which is the result, we have to influence concrete behaviors. In order to influence those, we need to influence people’s mindset. The most effective way to influence mindsets is to…
People don’t like being told to change their behaviors, because it’s like being told to change their minds. If you know anything about people, you know they hate changing their minds unless they are doing the changing, by their own choice.
This is why we’ve emphasized asking questions, raising awareness, and working together to make quality visible—instead of imposing process changes or technical solutions through force. People need to understand and buy into changes in order to embrace them fully. We can’t force them to make changes they don’t perceive as necessary or valuable if we want the change to be successful.
Of course, not everyone’s going to change their mindset at once—some may never come around at all. However, our ultimate goal should be to…
Make the right thing the easy thing!51
As we continue working to improve quality and make it visible, it will get easier and easier to do both. Practices and their results will become more accessible, encouraging wider and wider adoption. Eventually, we want to make it harder not to do the right thing, because the right thing will happen by default.
This will be challenging, and take time. It’s important to identify the right people to engage directly in the beginning, when we’re starting the process, and who to put off until later.
Focus on the Early Majority/Total Product
Geoffrey A. Moore, Crossing the Chasm, 3rd Edition
The “Crossing the Chasm” model from Geoffrey Moore’s book of the same name can help us make that identification.52 There are many nuances to it, but at a high level, it illustrates how different segments of a population respond to a particular innovation.
Innovators and Early Adopters are like-minded seekers, enthusiasts and visionaries who together bring an innovation to the market and lead people to adopt it. I like to lump them together and call them Instigators.
The Early Majority are pragmatists who are open to the new innovation, but require that it be accessible and ready to use before adopting it.
The Late Majority are followers waiting to see whether or not the innovation works for the Early Majority before adopting it.
Laggards are the resisters who feel threatened by the innovation in some way and complain about it the most. They may potentially raise valid concerns, but often they only bluster to rationalize sticking with the status quo.53
The Instigators face the challenge of bringing an innovation across The Chasm separating them from the Early Majority, developing what Moore calls The Total Product. Developing the Total Product requires that the Instigators identify and fulfill several needs the Early Majority have in order to facilitate adoption.
As Instigators leading people to improve software quality and make it visible, focus your energy on connecting with other Instigators and the early Early Majority. Don’t worry so much about the rest—focus on delivering the Total Product, and it will take care of the other groups.
The Rainbow of Death
This connection across the chasm isn’t part of the original Chasm model, but one I borrowed from a friend54 and called “The Rainbow of Death.” It helps illustrate those Early Majority needs the Instigators must satisfy. Doing so transforms the Early Majority from being dependent on the Instigators’ expertise into independent experts themselves.
My talk “The Rainbow of Death,” linked here, uses this model to tell the story of Google’s Testing Grouplet. It extracts order and clarity from five years of chaos, evolution, and eventually revolution.55 However, I’ve realized that it’s a great storytelling device, but too complicated a device to apply at the beginning of the change process.56
Instigating Culture Change
Essential needs an internal community must support
Instead, once you’ve started forming an internal community of like-minded fellow Instigators, it’s best to focus on these essential needs and simplify your initial efforts:57
- Individual Skill Acquisition
- Team/Organizational Alignment
- Quality Work/Results Visibility
Supporting each of these needs also helps support the other two, creating a virtuous cycle. No matter where you decide to focus as a starting point, the cycle gains momentum with every contribution of effort.58 However, it is important to focus on getting one effort fully in motion before trying to launch the next one.59
I’m going to give an overview of each of these needs and some ideas of how to address them. You can use these suggestions to determine a starting point, and to start percolating ideas for future efforts.
Individual Skill Acquisition
Help individuals incorporate principles, practices, language
Quality begins with the choices each of us make as individuals throughout our day. Awareness of sound quality principles and practices improves the quality of these choices. Developing a common language makes these principles and practices visible, so we can show them to one another, helping raise everyone’s quality game.
- Training, documentation, internal media, mentorship, sharing examples
We can offer training, documentation, and other internal media to spread awareness. We can also offer direct mentorship or share examples from our own experience to help others learn.
Now here’s a quick, high level summary of a few key principles and practices to help developers write better code and tests.
Testable code/architecture is maintainable—tests add design pressure, enable continuous refactoring; use code coverage as a tool, not a goal
Designing code for testability, given proper guidance on principles and techniques, adds design pressure that yields higher quality code in general. Having good tests then enables constant improvements to code quality through continuous refactoring, instead of stopping the world for complex, risky overhauls or rewrites.60
Good tests also enable developers to use code coverage as a tool while refactoring, helping ensure new and improved code replaces the previous code.61
Stop copy/pasting code; send several small code reviews, not one big one
Two common habits that contribute to worse code quality are duplicating code62 and submitting large changes for review.63 These changes make code difficult to read, test, review, and understand, which hides bugs and makes them difficult to find and fix after they’ve shipped. Helping people write testable code also helps people break these costly bad habits.
Tests should be designed to fail: naming and organization can clarify intent and cause of failure; use Arrange-Act-Assert (a.k.a. Given-When-Then)
The goal of testing isn’t to make sure tests always pass no matter what. The goal is to write tests that let us know, reliably and accurately, when our expectations of the code’s behavior differ from reality. Therefore, we should apply as much care to the design, naming, organization of our tests as we do to our production code.64
Interfaces/seams enable composition, dependency breaking w/ test doubles
Of course, many of us start out in legacy code bases with few tests, if any.65 So we also need to teach how to make safe changes to existing code that enable us to begin improving code quality and adding tests. Michael Feathers’s Working Effectively with Legacy Code is the seminal tome on this subject, showing how to gently break dependencies to introduce seams. “Seams” are points at which we introduce abstract interfaces that enable test doubles to stand in for our dependencies, making tests faster and more reliable.66
“Make interfaces easy to use correctly and hard to use incorrectly.”
To propose a slight update to make it more concrete:
“Make interfaces easy to use correctly and hard to use incorrectly—like an electrical outlet.”
Of course, it’s not impossible to misuse an electrical outlet, but it’s a common, wildly successful example that people use correctly most of the time.67 Making software that easy to use or change correctly and as hard to do so incorrectly may not always be possible—but we can always try.
Get everyone speaking the same language
Living up to that standard is a lot easier when the people you work with also consider it a priority.68 Once people gain the new insights, skills, and language we’ve discussed, team and organizational alignment creates the space necessary to apply this new knowledge successfully.
DONE TO HERE
The following notes are incomplete. I’ll expand them in the coming days and eventually announced the completed artifacts in a blog post.
Internal media, roadmap programs, presentations, advocates
Understanding between devs, QA, project managers/owners, executives
Focus and simplify! (Don’t swallow the elephant—leave some for later!)
Every earlier success lays a foundation and creates space for future effort
Absorb influences like a musician/band—then create your own voice/style
Calls to Action
TODO: Maybe a footnote about “Do Hard Things” and the typical view of building “mental toughness” having the exact opposite effect.
TODO: Connect Lewin’s model of change to “focus and simplify” and the “quality culture cycle” as well.
TODO: Maybe a footnote discussing “glue work,” how I understand the sentiment of the conclusion but feel we can do better by discussing, prioritizing leadership generally.
I appreciates all the folks who’ve contributed to this presentation!
And my fellow QCI Instigators at Apple for your past wisdom—
you know who you are!
(And you know that I know who you are!)
2023-01-17: Presented to Microsoft at the invitation of Ono Vaticone, a former Apple colleague and Quality Culture Initiative member.
Joel Schwartzberg’s Get to the Point! Sharpen Your Message and Make Your Words Matter inspired me to articulate this clear, concise point up front. ↩
David Marchese’s interview with Cal Newport for the New York Times on 2023-01-23, The Digital Workplace Is Designed to Bring You Down, bears mentioning here. Newport notes that with the rise of “knowledge work”, “we fell back to a proxy for productivity, which is visible activity.” Then:
“Visible activity as a proxy for productivity spiraled out of control and led to this culture of exhaustion, of I’m working all the time, I’m context shifting all over the place, most of my work feels performative, it’s not even that useful.”
He also noted Peter Drucker’s coining of the term “knowledge work” in 1959 and the consequences for management:
“So Drucker is saying that knowledge workers need to manage themselves. Managers just need to set them up to succeed. But then what do you manage? Visible activity as a proxy for productivity was the solution. We need something we can focus on day to day and feel that we’re having a role in pushing work: Let’s just manage visible activity. It’s this compromise that held the pieces together in an imperfect way, and then in the last 20 years, this centrifuge of digital-accelerated work blew it apart. The compromise is now failing.”
So there is a danger that trying to make work visible could dissolve into productivity theatre. At the same time, Newport unpacks his concept of “slow productivity,” the topic of his next book [emphasis mine]:
“So how do you actually work with your mind and create things of value? What I’ve identified is three principles: doing fewer things, working at a natural pace,9 but obsessing over quality. That trio of properties better hits the sweet spot of how we’re actually wired and produces valuable meaningful work, but it’s sustainable.”
9 Meaning one with more variability in intensity than the always-on pace to which we’ve become accustomed.
This presentation walks the line between making visible the aspects of our work that truly speak to software quality, and superficial displays of productivity. People often want to jump straight to solutions, and start generating performative “data” to prove their value. In doing so, they fail to grasp the underlying issues and end up continuing the negative cycle of increasing effort yielding decreasing quality.
We first need to help people get a handle on the issues and understand what we need to accomplish. This is why this talk makes the case for software quality and illustrates its obstacles before discussing solutions. It’s also why the solutions offered are rudimentary guidelines and techniques for inviting nuanced discussion and developing shared understanding that grows over time.
The punchline being, in the end, improving software quality is about leadership far more than it is about technology. Leadership requires helping people clearly see principles in action and getting results, so that they may learn from the example and achieve similar success. Hence, though making quality work visible may remain an imperfect practice involving trade-offs and compromises, it’s essential to improving software quality broadly across organizations. ↩
He mentions the fact that cruft and technical debt are basically the same in a sidebar, but it’s not on his graph. ↩
In my talk Automated Testing—Why Bother?, I go into several more reasons why automated testing helps developers understand the system, particularly when responding to failures. These include better managing the focusing illusion, the Zeigarnik effect, the orienting response, and the OODA Loop. (I learned about all of these except for the OODA Loop from Dr. Robert Cialdini’s Pre-Suasion: A Revolutionary Way to Influence and Persuade.) ↩
This seems somewhat ironic, since he invited me to publish Goto Fail, Heartbleed, and Unit Testing Culture on his website in 2014. It doesn’t focus only on the professionalism angle, but it emphasizes it heavily. He published the “Cost” article in 2019, reflecting an apparent evolution in his thinking.
I’m not criticizing Martin, or his argument—I’m rather grateful he came up with this brilliant angle, and explained it so thoughtfully and clearly. It’s incredibly helpful to move the conversation forward. I’m just not willing to abandon the “moralistic” appeal to professionalism, either. We need both.
In fact, I’d claim that a sense of professionalism necessarily precedes sound economic arguments in general. Raw economics doesn’t care about professionalism, but pragmatic professionals have to find a way to align the economics with their professional standards. That’s exactly what Martin did with this article.
Also, though he didn’t explicitly state this, it’s possible he meant “professionalism” in terms of “quality for its own sake” or “pride in one’s work.” Whereas the angle in my article, and in the slides to follow, is “professionalism” in terms of social responsibility, which also has an economic impact. I do believe in quality for its own sake and having pride in one’s work, but that’s not the appeal I tend to make, either.
Law 13: When asking for help, appeal to people’s self-interest, never to their mercy or gratitude
Though that title speaks specifically about gaining someone’s favor, the general principle of appealing to someone’s self interest to motivate their behaviors holds. That said, in the “Reversal” section at the end of the chapter on Law 13 states:
You must distinguish the differences among powerful people and figure out what makes them tick. When they ooze greed, do not appeal to their charity. When they want to look charitable and noble, do not appeal to their greed.
My interpretation of this principle in this context is: Don’t go all in on either the economic argument or appeals to professionalism. Use both, and presented well, I think they serve to reinforce one another.
So while I understand why Martin has taken the position he has, I’m slightly saddened by it. Or, if he’s responding to certain aspects of professionalism without distinguishing from the others, I’m only sad that he was uncharacteristically unclear on that point. Morals aren’t the only concern, but neither should economics be—alignment between them, rather than abandonment of one for the other, yields the best outcomes. ↩
People mostly had no experience with testing outside of the slowness and brittleness of the status quo, and were under constant delivery pressure while feeling intimidated by many of their peers. Who could blame them for not testing when they couldn’t afford the time to learn?
Economists call this “temporal discounting”. Basically, if someone’s presented with an option to push a feature now without tests, or prevent a problem in the future (that may or may not happen) through an investment in testing, they’ll tend to ship and hope for the best. Combined with the fact that the ever-slowing tools made it impossible to reach a state of flow, this combination of immediate pain and slow feedback in pursuit of a distant, unclear benefit made the “right” thing way harder than it needed to be.
Shortly after joining one team, I presented to my teammates my vision for improved testing adoption across the company and what it would take. One of my teammates said to me in this meeting, “…but unit testing is easy!” Caught off guard, my immediate impulse—which I didn’t catch in time—was to laugh out loud at this statement. I immediately apologized and explained that, yes, it isn’t that hard once you get used to it—but many haven’t yet learned good basic techniques. (I cover this a little later in this talk.)
Of course, my apology meant nothing—the damage was done. This teammate and I never ended up really seeing eye to eye. Per the Crossing the Chasm model covered later in “Building a Software Quality Culture,” I moved on rather than continuing to engage with this Laggard. ↩
I have to admit, this rant was inspired by coming across Tim Bray’s Testing in the Twenties. (I found it by way of Martin Fowler’s On the Diverse And Fantastical Shapes of Testing, which I cite in a later Test Pyramid footnote.) I strongly agree with the article for the most part (especially the “Coverage data” section), but it shits the bed with the “No religion” comments. I even agree with the main points contained in those comments. However, setting them up in opposition to “religion,” “ideology,” “pedantic arm-waving,” “TDD/BDD faith,” etc., brings an unnecessarily negative emotional charge to the argument. It would be much stronger, and more effective, without them.
Note that Bray’s article is strongly in favor of developers writing effective automated tests. That said, painting people who talk about test doubles and practice TDD as belonging to an irrational tribe (while implying one’s own superiority) is harmful. I’m sorely disappointed that this otherwise magnificent barrel full of wine contains this spoonful of sewage. (A saying I got from the “A Spoonful of Sewage” chapter of Beautiful Code.) ↩
I first learned about this concept from an Apple internal essay on the topic. ↩
The full title of chapter six is Chapter VI: An Accident Rooted in History. The data comes from [129-131] Figure 2. O-Ring Anomalies Compared with Joint Temperature and Leak Check Pressure. It lists 25 Space Shuttle launches, ending with STS 51-L. It indicates O-ring anomalies (erosion or blow-by) in 17 of the 24 launches (70%) prior to STS 51-L. In the 17 missions prior, starting with STS 41-B on 1984-02-03, there were 14 anomalies (82%). ↩
From Chapter VII: The Silent Safety Program., excerpts from “Trend Data” [155-156]:
As previously noted, the history of problems with the Solid Rocket Booster O-ring took an abrupt turn in January, 1984, when an ominous trend began. Until that date, only one field joint O-ring anomaly had been found during the first nine flights of the Shuttle. Beginning with the tenth mission, however, and concluding with the twenty -fifth, the Challenger flight, more than half of the missions experienced field joint O-ring blow-by or erosion of some kind….
This striking change in performance should have been observed and perhaps traced to a root cause. No such trend analysis was conducted. While flight anomalies involving the O-rings received considerable attention at Morton Thiokol and at Marshall, the significance of the developing trend went unnoticed. The safety, reliability and quality assurance program, of course, exists to ensure that such trends are recognized when they occur….
Not recognizing and reporting this trend can only be described, in NASA terms, as a “quality escape,” a failure of the program to preclude an avoidable problem. If the program had functioned properly, the Challenger accident might have been avoided.
The NASA Office of Safety & Mission Assurance site has other interesting artifacts, including:
This latter artifact is a powerfully concise distillation of lessons from the Rogers report. A couple of excerpts:
- Launch day temperatures as low as 22 °F at Kennedy Space Center.
- Thiokol engineers had concerns about launching due to the effect of low temperature on O-rings.
- NASA Program personnel pressured Thiokol to agree to the launch.
- We cannot become complacent.
- We cannot be silent when we see something we feel is unsafe.
- We must allow people to come forward with their concerns without fear of repercussion.
If you check out the Wilcutt and Bell presentation, and follow the “Symptoms of Groupthink” Geocities link, do not click on anything on that page. It’s long since been hacked. ↩
In Automated Testing—Why Bother?, I define automated testing as: “The practice of writing programs to verify that our code and systems conform to expectations—i.e. that they fulfill requirements and make no incorrect assumptions.” ↩
A further thought: Trusting tools like compilers to faithfully translate high-level code to machine code, and to optimize it, is one thing. Compilers are largely deterministic and relatively well understood. AI models are quite another, far more inscrutable, far less trustworthy instrument.
- Technical Competence: Is it safe?
- Organizational Clarity: Is it the right thing to do?
Maybe one day we’ll trust AI with the first question. I’m not so sure we’ll ever be able to trust it with the second. ↩
Feynman’s entire appendix is worth a read, but here’s another striking passage foreshadowing Wilcutt and Bell’s “lack of bad outcomes” assertion:
There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next.
I learned this principle from an Apple internal essay. ↩
My formal Google colleague Alex Buccino made a good point during a conversation on 2023-02-01 about what “delight” means to a certain class of programmers. He noted that it often entails building the software equivalent of a Rube Goldberg machine for the lulz—or witnessing one built by another. I agreed with him that, maybe for that class of programmers, we should just focus on “clarity” and “efficiency”—which necessarily excludes development of such contraptions. ↩
If people believe they lack the knowledge and power to solve a problem, they won’t even think of trying to solve it.
Actually…I’ve yet to start the book at the time I’m writing this sentence. I learned about it from Harvard expert on the worst thing about New Year’s resolutions—and how to beat it: ‘A profound loss of energy’ (CNBC, 2022-12-31). That article quotes Lahey’s four step process to “breaking our resistance to change”:
- Identify your actual improvement goal, and what you’d need to do differently to achieve it.
- Look at your current behaviors that work against your goal.
- Identify your hidden competing commitments.
- Identify big assumptions about how the world works that drive your resistance to change.
Assumptions, until identified, are essentially unspoken or unconscious beliefs. ↩
In the interview from an earlier footnote, The Digital Workplace Is Designed to Bring You Down, Cal Newport makes a relevant observation to this point:
“If we look through the history of the intersection of technology and commerce, we always see something similar, which is: When disruptive technology comes in, it takes a long time to figure out the best way to use it. There’s this case study5 from a Stanford economist about the introduction of the electric motor into the factory. He characterizes how long it takes before we figure out what was in hindsight the obvious way to use electric motors in factories, which was to put a small motor in every piece of equipment so I can control my equipment at exactly the level I want to use it. We didn’t do that for 20 or 30 years.”
5 Paul A. David’s “Computer and Dynamo: The Modern Productivity Paradox in a Not-Too-Distant Mirror,” published in 1989.
In other words, known solutions still take time to sink in and become so obvious and easy to use that they become common practice. So maybe we’re approaching the tipping point as I write this sentence on January 23, 2023, as unit testing is over 30 years old. ↩
In Software at Scale 53 - Testing Culture with Mike Bland, I discuss with Utsav Shah why effective testing practices haven’t yet caught on everywhere. It seems part of the human condition is that wisdom passed down through the ages still requires that individuals seek it out. Good examples and teachers can help, but those aren’t always accessible to everyone, at least not without some self-motivated effort to find them.
By way of analogy, I mentioned just having read the Bhagavad Gita, which on the surface is mortifying by today’s standards. The warrior Arjuna doesn’t want to go to war against his own family, and the supreme being, Krishna, convinces him that it’s his duty. However, read as only a metaphor for profound internal conflict and doubt, which was accessible to the audience of the day, the message is reassuring. But it takes a willfully open mind to derive such value.
On top of that, one of the main lessons is that one should feel attached to doing one’s work—but not to the outcomes. This is a pretty common theme, also running through The Daily Stoic, which I also recently finished. Other traditions, notably Buddhism and Taoism, also teach detachment from outcomes and other things beyond one’s control generally.
However, despite this message being developed in multiple ancient cultures and spreading throughout history, tradition, and literature, people still struggle with attachment to this day. The essence of such wisdom isn’t necessarily complicated, but it’s often obscured by other natural preoccupations of both individuals and cultures.
This doesn’t contradict Cal Newport’s observation above on the time it takes for organizations to assimilate new technologies. It perhaps helps explain, at least in part, why it takes so long. ↩
The pyramid was later popularized by Mike Cohn in The Forgotten Layer of the Test Automation Pyramid (2009) and Succeeding with Agile: Software Development Using Scrum (2009). Not sure if Mike had seen the Noogler lecture slide or had independently conceived of the idea, but he definitely was a visitor at Google at the time I was there. ↩
The Testing Grouplet introduced the Small, Medium, Large nomenclature as an alternative to “unit,” “integration,” “system,” etc. This was because, at Google in 2005, a “unit” test was understood to be any test lasting less than five minutes. Anything longer was considered a “regression” test. By introducing new, more intuitive nomenclature, we inspired productive conversations by rigorously defining the criteria for each term, in terms of scope, dependencies, and resources.
The Bazel Test Encyclopedia and Bazel Common definitions use these terms to define maximum timeouts for tests labeled with each size. Neither document speaks to the specifics of scope or dependencies, but they do mention “assumed peak local resource usages.” ↩
Some have advocated for a different metaphor, like the “Testing Trophy” and so on, or for no metaphor at all. I understand the concern that the Test Pyramid may seem overly simplistic, or potentially misleading should people infer “one true test size proportion” from it. I also understand Martin Fowler’s concerns from On the Diverse And Fantastical Shapes of Testing, which essentially argues for using “Sociable vs. Solitary” tests. His preference rests upon the relative ambiguity of the terms “unit” and “integration” tests.
However, I feel this overcomplicates the issue while missing the point. Many people, even with years of experience in software, still think of testing as a monolithic practice. Many still consider it “common sense” that testing shouldn’t be done by the people writing the code. As mentioned earlier, many still think “testing like a user would” is “most important.” Such simplistic, unsophisticated perspectives tend to be resistant to nuance. People holding them need clearer guidance into a deeper understanding of the topic.
The Test Pyramid metaphor (with test sizes) is an accessible metaphor for such people, who just haven’t been exposed to a nonmonolithic perspective on testing. It honors the fact that we were were all beginners once (and still are in areas to which we’ve not yet been exposed). Once people have grasped the essential principles from the Test Pyramid model, it becomes much easier to have a productive conversation about effective testing strategy. Then it becomes easier and more productive to discuss sociable vs. solitary testing, the right balance of test sizes for a specific project, etc. ↩
Thanks to Oleksiy Shepetko for mentioning the maintenance cost aspect during my 2023-01-17 presentation to Ono Vaticone’s group at Microsoft. It wasn’t in the table at that time, and adding it afterward inspired this new, broad, comprehensive table layout. ↩
Test doubles are lightweight, controllable objects implementing the same interface as a production dependency. This enables the test author to isolate the code under test and control its environment very precisely via dependency injection. (“Dependency injection” is a fancy term for passing an object encapsulating a dependency into code that uses it as a constructor or function argument. This replaces the need for the code instantiating or accessing the dependency directly.)
I also defined them on slide 49 of Automated Testing—Why Bother?:
Test doubles are substitutes for more complex objects in an automated test. They are easier to set up, easier to control, and often make tests much faster thanks to the fact that they do not have the same dependencies as real production objects.
The various kinds of test doubles are:
- Dummy: A placeholder value with no bearing on the test other than enabling the code to compile
- Stub: An object programmed to return a hardcoded or trivially computed value
- Spy: A stub that can remember how many times it was called and with which arguments
- Mock: An object that can be programmed to validate expected calls in a specific order, as well as return specific values
- Fake: An object that fully simulates a production dependency using a less complicated and faster implementation (e.g., in memory database or file system, local HTTP server)
People often call all test doubles “mocks,” and packages making it easy to implement test doubles are often called “mocking libraries.” This is unfortunate, as mocks should be the last option one should choose.
Mocks can validate expected side effects (i.e., behaviors not reflected in return values or easily observable environmental changes) that other test doubles can’t. However, this binds them to implementation details that can render tests brittle in the face of implementation changes. Tests that overuse mocks in this way are often cited as a reason why people find test doubles and unit testing painful.
My favorite concrete, physical example of using a test double is using a practice amplifier to practice electric guitar:
You can practice in relative quiet, using something even as small as a Marshall micro amp, a Blackstar Fly 3, or a Mustang micro. You do this before playing with others or getting on stage to make sure you’ve got your own parts down. If anything sounds bad, you know it’s all your fault.
This is analogous to writing small tests, with the practice amp as the test double. You’re figuring out in near real time if your own performance meets expectations or needs fixing—without bothering anyone else.
You can rehearse with your band with a larger, louder amplifier, like a Marshall DSL40CR or Fender Mustang GTX100. This enables you to work out issues with other players before getting on stage. If you’ve practiced your parts enough and something sounds bad at this point, you know something’s wrong with the band dynamic.
This is analogous to writing medium tests, with the slightly larger amp still acting as a test double. You’re figuring out with your bandmates specific issues arising from working through the material together. You can start, stop, and repeat as often as necessary without burdening the audience.
You and the band can then run through a soundcheck on stage, making sure everything sound good together while plugging into your Marshall stacks. Everyone else will be using their production gear, the full sound system, and the lighting rig, in the actual performance space. If the band is well rehearsed but something sounds wrong at this level, you know it’s specific to the integration of the entire system.
This is analogous to writing large tests. You’re using the real production dependencies and running the entire production system. However, this is still happening before the actual performance, giving you a chance to detect and resolve showstopper issues before the performance.
Finally, you play in front of the audience. Things can still go wrong, and you’ll have to adapt in the moment and discuss afterwards how to prevent repeat issues. However, after all the practicing, rehearsals, and soundchecks, relatively few things could still go wrong, and are likely unique to actual performance situations.
This is analogous to shipping to production. You can’t expect perfection, and you may discover new issues not uncovered by previous practicing, rehearsing, and testing. However, you can focus on those relatively few remaining issues, since so many were prevented or resolved before this point.
Of course, there are more options than this. There’s nothing saying you couldn’t use any of these amplifiers in any other situation—you could use, say, the Fender Mustang GTX100 for everything. It can even plug directly into the mixing deck and emulate a mic’d cabinet. But hopefully the point of the analogy remains clear: The common interface gives you the freedom to swap implementations as you see fit.
The only question is, what kind of “test double” is a practice amplifier? Based on the definitions above, my money’s on calling it a “fake.” It’s a lighter weight implementation of the full production dependency, with the exact same interface, but without an interface for preprogramming responses.
Shoutout to Francisco Candalija for bringing contract and collaboration tests to my attention. He influenced how I now think and talk about medium/integration tests and my own “internal API” concept. (Some of the below I also described in an email to my former Google colleague Alex Buccino on 2022-12-23.)
Contract tests essentially answer the question: “Did something change that’s beyond my control, or did I screw something up?”
I like thinking of contract tests in this way rather than how Pact defines them, even though the Pact definition is very popular. Writing a contract test quickly using a special tool and calling it a day can provide a false sense of confidence. Such tests are prone to become brittle and flaky if one doesn’t consider how they support the overall architecture and testing strategy.
An “internal API” is a wrapper that’s kind of a superset of Proxy and Adapter from Design Patterns. It’s an interface you design within your project that translates an external (or complicated internal) dependency’s language and semantics into your own custom version. Using your own interface insulates the rest of your code from directly depending on the dependency’s interface.
One very common example is creating your own
Databaseobject that exposes your own “ideal” Database API to the rest of your app. This object encapsulates all SQL queries, external database API calls, logging, error handling, and retry mechanisms, etc. in a single location. This obviates the need to pepper these details throughout your own code.
What this means is:
- The internal API introduces a seam enabling you to write many more fast, stable, small tests for your application via dependency injection and test doubles. (Michael Feathers introduced the term “seam” in Working Effectively with Legacy Code.) This makes the code and the tests easier to write and to maintain, since the all the tests no longer become integration tests by default.
- You do still need to test your API implementation against the real dependency—but now you have only one object to test using a medium/integration test. This would be your contract test.
- Any integration problems with a particular dependency are detected by one test, rather than triggering failures across the entire suite. This improves the signal to noise ratio while tightening the feedback loop, making it faster and easier to diagnose and repair the issue.
- The contract test makes sure any test doubles based on the same interface as the internal API wrapper are faithful to production. If a contract test fails in a way that invalidates your internal API, you’ll know to update your API and test doubles based on it.
- If you want to upgrade or even replace a dependency, you have one implementation to update, not multiple places throughout the code. This protects your system against revision or vendor lock in.
- In fact, you can add an entirely new class implementing the same interface and configure which implementation to use at runtime. This makes it easy and safe to try the old and new implementations without major surgery or risk.
For all these reasons, combining internal APIs with contract tests makes your test suite faster, more reliable, and easier to maintain.
I did this not long ago for some Python code that threw a DBAPI error in production every few days, locking up our server fleet:
I reproduced the bug, in which the system didn’t abort a failed transaction due to a dropped connection, blocking further operations.
I wouldn’t call the test “small” or “medium,” but “small-ish.” It was as small a contract test as you could get, and while it wasn’t super fast, it was quite quick.
I fixed the bug—and the test—by introducing a
Databaseabstraction that implemented a rollback/reconnect/retry mechanism. The relatively small size, low complexity, and quick speed of the test enabled me to iterate quickly on the solution.
(I also set a one hour timeout on database connections. This alone might’ve resolved the problem, but it was worth adding the new abstraction that provably resolved the problem.)
I shipped the fix—and bye bye production error! I kept monitoring the logs and never saw it happen after that.
This contract test enabled me to define an internal
DatabaseAPI based on the Python DBAPI. The DBAPI ensures that the
DatabaseAPI can be reused—and tested—with different databases that conform to its specifications. The rest of our code, now using the new
Databaseobject, could be tested more quickly using test doubles. So long as the contract test passes, the test doubles should remain faithful substitutes. And if we wanted to switch from Postgres to another production database, likely none of our code would’ve had to change.
The contract test did require some subtle setup and comments explaining it. Still, dealing with one such test and object under test beats the hell out of dealing with one or more large system tests. And it definitely beats pushing a “fix” and having no idea whether it stands a chance of holding up in production! ↩
I deliberately avoid saying which specific proportion of test sizes is appropriate. The shape of the Test Pyramid implies that one should generally try to write more small tests, fewer medium tests, and relatively few large tests. Even so, it’s up to the team to decide, through their own experience, what the proportions should be to achieve optimal balance for the project. The team should also continue to reevaluate that proportion continuously as the system evolves, to maintain the right balance.
I also have scar tissue regarding this issue thanks to Test Certified. Intending to be helpful, we suggested a rough balance of 70% small, 20% medium, and 10% large as a general target. It was meant to be a rule of thumb, and a starting point for conversation and goal setting—not “The One True Test Size Proportion.” But OMG, the debates over whether those were valid targets, and how they were to be measured, were interminable. (Are we measuring individual test functions? Test binaries/BUILD language targets like cc_test? Googlers, at least back then, were obsessed with defining precise, uniform measurements for their own sake.)
On the one hand, lively, respectful, constructive debate is a sign of a healthy, engaged, dynamic community. However, this particular debate—as well as the one over the name “Test Certified”—seemed to miss the point, amounting to a waste of time. We just wanted teams to think about the balance of tests they already had and needed to achieve, and to articulate how they measured it. It didn’t matter so much that everyone measured in the exact same way, and it certainly didn’t matter that they achieve the same test ratios. It only mattered that the balance was visible within each individual project—and to the community, to provide inspiration and learning examples.
Consequently, while designing Quality Quest at Apple, we refrained from suggesting any specific proportion of test sizes, even as a starting point. The language of that program instead emphasized the need for each team to decide upon, achieve, and maintain a visible balance. We were confident that creating the space for the conversation, while offering education on different test sizes (especially smaller tests), would lead to productive outcomes. ↩
“Flaky” means that a test will seem to pass or fail randomly without a change in its inputs or its environment. A test becomes flaky when it’s either validating behavior too specific for its scope, or isn’t adequately controlling all of its inputs or environment—or both. Common sources of flakiness include system clocks, external databases, or external services accessed via REST APIs.
A flaky test is worse than no test at all. It conditions developers to spend the time and resources to run a test only to ignore its results. Actually, it’s even worse—one flaky test can condition developers to ignore the entire test suite. That creates the conditions for more flakiness to creep in, and for more bugs to get through, despite all the time and resources consumed.
In other words, one flaky test that’s accepted as part of Business as Usual marks the first step towards the Normalization of Deviance.
There are three useful options for dealing with a flaky test:
- If it’s a larger test trying to validate behavior too specific for its scope, relax its validation, replace it with a smaller test, or both.
- If what it’s validating is correct for its scope, identify the input or environmental factor causing the failure and exert control over it. This is one of the reasons test doubles exist.
- If you can’t figure out what’s wrong or fix it in a reasonable amount of time, disable or delete the test.
Retrying flaky tests is NOT a viable remedy. It’s a microcosm of the Arms Race as a whole. Think about it:
- Every time a flaky test fails, it’s consuming time and resources that could’ve been spent on more reliable tests.
- Even if a flaky tests fails on every retry, people will still assume the test is unreliable, not their code, and will merge anyway.
- Increasing retries only consumes more resources while enabling people to continue ignoring the problem when they should either fix, disable, or delete the test.
- Bugs will still slip through, introduce risk, and create rework even after all the resources spent on retries.
The last thing you want to do with a flaky or otherwise consistently failing test is mark it as a “known failure.” This will only consume time and resources to run the test and complicate any reporting on overall test results.
Remember what tests are supposed to be there for: To let you know automatically that the system isn’t behaving as expected. Ignoring or masking failures undermines this function and increases the risk of bugs—and possibly even catastrophic system failure.
Assume you know that a flaky or failing test needs to be fixed, not discarded. If you can’t afford to fix it now, and you can still afford to continue development regardless, then disable the test. This will save resources and preserve the integrity of the unambiguous pass/fail signal of the entire test suite. Fix it when you have time later, or when you have to make the time before shipping.
Note I said “if you can still afford to continue development,” not “if you must continue development.” If you continue development without addressing problems you can’t afford to set aside, it will look like willful professional negligence should negative consequences manifest. It will reflect poorly on you, on your team, and on your company.
Also note I’m not saying all failures are necessarily worthy of stopping and fixing before continuing work. The danger I’m calling out is assuming most failures that aren’t quickly fixable are worth setting aside for the sake of new development by default. Such failures require a team discussion to determine the proper course of action—and the team must commit to a clear decision. The failure to have that conversation or to commit to that clear decision invites the Normalization of Deviance and potentially devastating risks. ↩
Frequent demos can be a very good thing—but not when making good demos is appreciated more than high internal software quality and sustainable development. ↩
I’ve called this concept of collecting signals to inform decision making “Vital Signs” because I believe “data-driven decision making” has lost its meaning. As often happens with initially useful innovations, the term “data-driven decision making” has become a buzzword. It’s a sad consequence of a misquote of W. Edwards Deming, an early pioneer of data-driven decision making, who actually said:
“It is wrong to suppose that if you can’t measure it, you can’t manage it—a costly myth.”
—The New Economics, Chapter 2, “The Heavy Losses”
Over time, this got perverted to “If you can’t measure it, you can’t manage it.” (The perversion is likely because people know him as a data advocate, and are ignorant of the subtlety of his views.)
Many who vocally embrace data-driven decision making today tend to put on a performance rather than apply the principle in good faith. They tend to want to let the data do the deciding for them, absolving them of professional responsibility to thoughtfully evaluate opportunities and risks. It’s a ubiquitously accepted Cover Your Ass rationale, a shield offering protection from the expectation of ever having to take any meaningful action at all. It also a hammer used to beat down those who would take such action—especially new, experimental action lacking up front evidence of its value. Even so, often “the data shows” that we should do nothing, or do something stupid or unethical. This holds even when other salient, if less quantifiable signals urge action, or a different course of action.
As such, allegiance to “data-driven decision making” tends to encourage Groupthink and to produce obstacles to meaningful change. On the contrary, “Vital Signs” evokes a sense of care for a living system, and a sense of commitment to ensuring its continued health. It implies we can’t check a box to say we’ve collected the data and can take system quality and health for granted. We have to keep an eye on our system’s Vital Signs, and maintain responsibility for responding to them as required.
My visceral reaction arises from all the experiences I’ve had (using Crossing the Chasm terminology) with Late Majority members lacking courage and Laggards resisting change. I’ll grant that the Late Majority may err on the side of caution, and once they’re won over, they can become a force for good. But Laggards feel threatened by new ideas and try to use data, or the lack thereof, as a weapon. Then when you do produce data and other evidence, they want to move the goalposts.
The Early Majority is a different story altogether. I’ve had great experiences with Early Majority members who were willing to try a new approach to testing and quality, expecting to see results later. Once we made those results visible, it justified further investment. This is why it’s important to find and connect with the Early Majority first, and worry about the Late Majority later—and the Laggards never, really. ↩
I’m often asked if teams should always achieve 100% code coverage. My response is that it one should strive for the highest code coverage possible. This could possibly be 100%, but I wouldn’t worry about going to extreme lengths to get it. It’s better to achieve and maintain 80% or 90% coverage than to spend disproportionate effort to cover the last 10% or 20%.
That said, it’s important to stop looking at code coverage as merely a goal—use it as a signal that conveys important information. Code coverage doesn’t show how well tested the code is, but how much of the code isn’t exercised by small(-ish) tests at all.
So it’s important to understand clearly what makes that last 10% or 20% difficult or impractical to cover—and to decide what to do about it. Is it dead code? Or is it a symptom of poor design—and is refactoring called for? Is there a significant risk to leaving that code uncovered? If not, why keep it?
Another benefit to maintaining high coverage is that it enables continuous refactoring. The Individual skill acquisition section expands on this. ↩
As the linked page explains, the “R” in “MTTR” can also stand for “Repair,” “Recovery,” or “Respond.” However, I like to suggest “Resolve,” because it includes response, repair, recovery, and a full follow through to understand the issue and prevent its recurrence. ↩
SonarQube is a popular static analysis platform, but I’m partial to Teamscale, as I happen to know several of the CQSE developers who own it. They’re really great at what they do, and are all around great people. They provide hands-on coaching and support to ensure customers are successful with the system, which they’re constantly improving based on feedback. I’ve seen them in action, and they deeply understand that it’s the tool’s job to provide insight that facilitates ongoing conversations.
(No, they’re not paying me to advertise. I just really like the product and the people behind it.)
I also like to half-jokingly say Teamscale is like an automated version of me doing your code review—except it scales way better. The more the tool automatically points out code smells and suggests where to refactor, the more efficient and more effective code reviews become. ↩
I can’t remember where I got the idea, but it’s arguably better to develop a process manually before automating it. In this way, you carefully identify the value in the process, and which parts of it would most benefit from automation. If you start with automation, you’re not starting from experience, and people may resent having to use tools that don’t fit their actual needs. This applies whether you’re building or buying automation tools and infrastructure.
Of course, if you have past experience and existing, available tools, you can hit the ground running more quickly. The point is that it’s wasteful to wait for automation to appear when you could benefit from a process improvement now, even if it’s manual. ↩
These next two statements defining “culture” are my paraphrase of a concept I discovered from an Apple internal essay. ↩
The Crossing the Chasm model can be traced back to Everett Rogers’s Diffusion of innovations model from 1962. That model differentiated the five populations, but lacked a “chasm.” The chasm was added by Lee James and Warren Schirtzinger of Regis McKenna Inc., where Moore also worked.
Articles that dig into the Chasm’s history include:
The first article above presents a number of criticisms of the Crossing the Chasm model. Like criticisms of the Test Pyramid model, I think they split hairs and miss the point. Not because their points aren’t valid, but because they’re better presented as further refinements for consideration after grasping the concept, not criticisms of the model.
No model is perfect, but a good one is at least effective at bringing new people into the conversation. Once they’re in, and comfortable with the concepts and the language, we can point out nuances not captured by the model. But without the model, people may not gain access to the conversation to begin with. ↩
I’ve had people suggest that Laggards are actually the dominant population, comprising the actual majority. I remind them that it only seems that way—they’re the most vocal because they feel they have something to lose. Once both Majorities adopt an innovation, their voices lose power. ↩
Albert Wong, former Googler and member of the U.S. Digital Service. I saw his original model in his presentation on his early work as a member of the USDS, working with Citizenship and Immigration Services. In my mind, I instantly saw it snapping into the Chasm—and helping me make sense of the Google Testing Grouplet’s story.
I asked Albert if I could borrow the model, and he agreed. I also asked if he minded me giving it a funny name, and he didn’t.
The multicolored span of the model reminds me of rainbow, and my weird sense of humor inspired me to pair it with an incongruous concept. Hence, “The Rainbow of Death.”
Two years after I started using the model, I realized how the concept of “Death” actually fits. The model helps explain how the problem you want to solve may not be the problem you have to solve first. To achieve that insight, old ideas about the problem and what’s required to solve it have to die to make room for new ideas.
For example, the Testing Grouplet wanted to improve automated testing and software quality—but we had to figure out how to sell others on it. We eventually realized we needed to do more than train new hires once, host tech talks, and give out books. We kept doing those things, but we couldn’t only continue putting information out there in the hopes that people would use it. We realized we needed to get people more directly engaged—leading to Testing on the Toilet, Test Certified, the Test Mercenaries, and a series of Fixits. Our work also influenced build and testing infrastructure development, culminating in the launch of the Test Automation Platform.
More to come in a following footnote… ↩
The “Revolution” was the third Google-wide testing Fixit I organized, helping set up the TAP (Test Automation Platform) Fixit two years later. This event introduced Google’s now famous cloud based build and testing infrastructure to projects across the company. I named it after my favorite Beatles tune (tied with “I Am the Walrus”), leading to spectacularly Beatles-themed announcements, advertisements, prizes, etc.
One of the neatest things was that, for weeks afterwards, I would hear people talking about “Revolutionizing” their builds. Even though not every project participated fully on the Fixit day, within a year, every project had migrated to the new infrastructure. I compare the before and after effects in Coding and Testing at Google, 2006 vs. 2011.
I never got around to blogging about either the Revolution or the TAP Fixit before I ran out of steam writing about Google in 2012. Time has passed, and many memories have faded, but I may yet try to share what I’m able one day. ↩
After developing the Rainbow of Death, I kept trying to use it as an answer key. I’d show it to people and expect them to “get it,” shortcut the exploration phase, get straight to implementation, and shave years off the process.
After hitting the wall for about the third time, at Apple, I eventually realized it wasn’t an answer key, but a blueprint. Yes, it can help trained experts understand what the finished structure looks like. However, it has to come together over time, with many adjustments along the way. You have to find and purchase a site, prepare the site, put in the framing, then the electrical and plumbing infrastructure, etc. You can’t have the bulldozers, construction workers, roofers, siders, painters, and interior decorators all start at the same time. And the assumption is they are all already knowledgeable in what they need to do, and they’re already bought into doing it.
Spreading adoption of good automated testing practices has its own order of dependencies—and you have to provide education and secure buy in as you go. Sharing the Rainbow of Death is fun and useful for existing Instigators, especially after completing the mission, showing how years of chaos converged into achievement. But it’s not the most effective tool for recruiting new Instigators and influencing the Early Majority. There really aren’t any shortcuts; it’s always going to take time.
In other words, my own idea about how to approach the problem using the Rainbow of Death needed to die, so new ideas could emerge. Specifically, I needed to set aside the complexity of the Rainbow of Death, and embrace the “focus and simplify” principle as a starting point instead. ↩
An Apple internal article used the example of Amundsen and Scott’s expeditions to the South Pole to illustrate the need to “focus and simplify.” Amundsen focused on getting there with the best sled dogs, succeeded on 1911-12-14, and survived. Scott tried a diversified approach, and did reach the South Pole on 1912-01-17, but he and his crew all died during the return trip.
Other articles outside Apple highlight other differences in the mindset and leadership styles between the two. Amundsen was adaptable to conditions beyond his control, learned from the wisdom of others, assembled the most skilled team possible, and paid attention to details. Scott didn’t heed the weather, was casual about team composition and details, and plowed ahead through sheer assertion of confidence. Like Feynman later warned, nature will not be fooled by public relations.
The Rainbow of Death presentation uses the model to describe how the Testing Grouplet built up its efforts over time, one step at a time. We did try many things, some in parallel, but we tended to establish one major program at a time before focusing on establishing another. Ironically, I then later tried to use the model in several organizations to launch a bunch of efforts in parallel from the very start.
Thankfully I finally learned my lesson at Apple, and got the Quality Culture Initiative to focus and simplify. First, we got our training program fully completed, launched, scheduled, and staffed. Then the internal podcast team got serious about publishing episodes more regularly. While I was focused on those things, another core QCI member got Quality Quest on a strong footing in his organization. We then merged it back into the QCI mainstream, allowing it to spread to other organizations.
After that, we began experimenting again with other projects, some sticking, some not so much. Whenever a project seemed to stall, we’d invoke our “focus and simplify” mantra and pour that focus into more productive areas. ↩
Of course Martin Fowler is famous for popularizing the term “refactoring” thanks to his book Refactoring: Improving the Design of Existing Code. He defines the term specifically thus (on the refactoring.com page):
“Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.
“Its heart is a series of small behavior preserving transformations. Each transformation (called a ‘refactoring’) does little, but a sequence of these transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each refactoring, reducing the chances that a system can get seriously broken during the restructuring.”
The spirit of this is encapsulated by a famous tweet from Kent Beck:
“for each desired change, make the change easy (warning: this may be hard), then make the easy change”
Also note on the refactoring.com page that Martin specifically asserts that “Refactoring is a part of day-to-day programming” and goes on to describe how. In the book, Martin gives this advice to those who still feel they need to ask for permission before refactoring anything:
“Of course, many managers and customers don’t have the technical awareness to know how code base health impacts productivity. In these cases, I give my most controversial advice: Don’t tell!
“Subversive? I don’t think so. Software developers are professionals. Our job is to build effective software as rapidly as we can. My experience is that refactoring is a big aid to building software quickly.”
Remember from earlier that Apple’s
goto failbug was hidden by six copies of the same algorithm in the same file. To see how unit testing discipline could’ve caught or prevented this by discouraging duplication, see my article “Finding More Than One Worm in the Apple.” This example also illustrates that the Don’t Repeat Yourself (DRY) principle isn’t a purely academic concern.
There’s a school of thought that suggests duplication is OK before landing on the correct abstraction. I consider this dangerous advice, because it’s so easily misunderstood and used to justify low standards. Programmers are notorious for taking shortcuts in code quality in order to move onto the next new thing to work on. They’re also notorious for using any available rationale to justify this behavior, and often disparaging more thoughtful approaches as “religion.” (Not that some can’t get carried away in the opposite direction—but it’s more common to find programmers attacking “religion” than programmers that are certifiable zealots.)
I understand the utility of duplicating bits of code in one’s private workspace while experimenting with a new change. However, I think the fear of the potential costs of premature abstraction are overblown. The far, far greater danger is that of “experimental” duplication getting shipped, leading to hesitation to change shipping code. Instead of the “hasty abstraction” getting baked in, dangerous duplication gets baked in instead.
After all, a premature abstraction should prove straightforward to reverse. Working with it should quickly reveal its shortcomings, which suggest refactoring it or breaking it apart in favor of duplicating its code for some reason. If it wasn’t premature, then making changes to the only copy is less time consuming and error prone than having to update multiple copies.
Replacing duplication with a suitable abstraction after the fact should be easy, but it gives cover to potentially unnoticed bugs in the meanwhile. Again,
goto failillustrates how easy it is to miss bugs in duplicate code. Once you’ve seen the first copy, the rest tend to look the same, even if they’re not. Our brains are so eager to detect and match patterns that they trick us into skipping over critical details when we’re not careful. (I believe this is because we process duplicate code with “System 1” thinking instead of more expensive “System 2” thinking, per Thinking, Fast and Slow.) ↩
We all know a 50 line code change is generally much faster and easier to review than a 500 line change. (500 lines of new or changed behavior, that is—500 lines of search and replace or deleted code is different.) Encouraging smaller reviews encourages decomposing larger changes into a series of smaller ones that can be independently tested, reviewed, and merged. This enables more thorough reviews, faster and more stable tests, and higher long term code quality and maintainability.
Even so, some hold onto the dated belief that one should submit entire feature changes at once to avoid “dead code.” The thinking, I suppose, is that one risks introducing unused code if a larger change is introduced one piece at a time. The value judgment seems to be that unused code is a greater risk to quality than, say, poorly tested code.
This, however, increases the risk of checking in “deadly code” that contains a bug that could harm users in some way. This is because larger changes are generally more difficult to test and review thoroughly. Overcompensating for poor design sense, poor communication, poor code quality, and poor process by mandating ill advised all-at-once changes can’t overcome those issues. In fact, it all but guarantees their perpetuation. ↩
Of course, you’ll hear people make some variation of the excuse “It’s just test code” for writing sloppy tests. However, if the tests are there to ensure the quality and readiness of the production code, then the tests are part of our production toolchain. If a test fails, it should halt production releases until we’ve aligned the reality of the system’s behavior with our expectations (like Toyota’s andon cord). If a failure doesn’t warrant a halt in production, the test is a waste of resources (including precious developer attention) and should be removed. As such, our tests deserve as much respect and care as any other part of our value creating product or infrastructure. ↩
“To me, legacy code is simply code without tests.”
—Preface, p. xvi
His rationale, from the same page:
“Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or how well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.”
Feathers further explains that “seams” are where we can change the behavior of code without changing the code itself. There are three kinds of seams:
- Preprocessor seams use
#definemacros to rewrite the code in languages that use the C preprocessor.
- Link seams use a static or dynamic linker, or the runtime loader, to
change how a program binary is built or run. Examples include manipulating
CLASSPATHenvironment variables (or their equivalents in other languages’ build and runtime environments).
- Polymorphic seams rely upon dependency injection to build an object graph at runtime. This allows the program itself to choose which implementations to include—such as test programs using test doubles to emulate production dependencies.
Polymorphic seams are the most common and most flexible kind, as well as the first one we reach for to write testable code. The term is essentially synonymous with “dependency injection.” Preprocessor and link seams aren’t as flexible, scalable, or easy to use, but can work if you have no reasonable opportunity to introduce polymorphic seams.
Note that using any seam successfully depends on the quality of the interface that defines it. The upcoming Scott Meyers quote speaks to that.
- Preprocessor seams use
I first started using electrical outlets as an example in Automated Testing—Why Bother?:
First, we need to understand the fundamental building block of testable code: Abstractions, as defined by interfaces. We create abstractions every time we write a class interface, or a module interface, or an application programming interface. And these abstractions perform two powerful functions:
- They define a contract such that certain inputs will produce certain outputs and side-effects.
- They provide seams between system components that allow for isolation between components.
My favorite example of a powerful interface boundary is an electrical outlet. The shape of the outlet defines a contract between the power supplier and the power consumer, which remain thoroughly isolated from one another beyond the scope of that physical boundary.21 It’s easier to reason about both sides of the interface than if the consumer was wired directly into the source.
In software, problems arise when we fail to consider one or the other of these functions, when either the contract isn’t rigorously defined and understood, or when the interfaces don’t permit sufficient isolation between components. This often happens when we fail to design our interfaces intentionally.
In contrast, the more intentional our interfaces, the more natural our abstractions and seams. Automated testing obviously serves to validate the constraints of an interface contract; but the process of writing thorough, readable, reliable tests also encourages intentional interface design. “Testable” interfaces that enable us to exercise sufficient control in our tests tend to be “good” interfaces, and vice versa. In this way, testability forces a host of other tangible benefits, such as readability, composability, and extensibility.
Basically, “testable” code is often just easy to work with!
21 For a list of electrical plug and outlet specs used across the world, see: https://www.worldstandards.eu/electricity/plugs-and-sockets/
Put more concretely: the power source could be anything from coal to wind, hydro, solar, or a hamster wheel. The consumer could be a lamp, a computer, or a wall of Marshall stacks. The shape of the outlet should ensure the voltage and amperage matches such that neither side cares what’s on the other—it all just works! A fault, failure, or other problem on one side won’t usually damage the other, either. This is especially true given common safety infrastructure such as surge protectors, fuses, and circuit breakers. Plus, you can use an electrical outlet tester as a test double to detect potential wiring issues.
It also greatly simplifies debugging (also sampled from my 2022-12-23 email with Alex Buccino):
If a plugged in device stops working, but the lights are still on in your house/building, you can check a few things yourself. You can see if it’s unplugged, if a switch was flipped, if a fuse/breaker blew, or if the device itself is faulty. You can pinpoint and fix most of these issues quickly, with no need to worry about the electrical grid.
However, if all the lights went off in your house at the same time, the problem’s beyond your control. Unless you work for the electric company, you should be able to trust that the company will send a crew to resolve the issue shortly.
Were the device wired into the electrical system directly, however, your debugging and resolution would be more costly and risky. Also, the delineation of responsibility between yourself and the electric company might not be as clear.
The common electrical outlet is a remarkably robust interface that unleashes enormous productivity every day—imagine if software in general was even remotely as reliable! ↩
This is a paraphrase of a similar statement by my former colleague Max Goldstein. ↩