Mike Bland

The Inverted Test Pyramid

Many projects have too many large, slow, flaky tests and few smaller ones. Retrying failed tests and marking known failures introduce risk and waste. Examining the root causes is essential to breaking the cycle.


Tags: Making Software Quality Visible, Test Pyramid, testing

This sixteenth post in the Making Software Quality Visible series describes the common Inverted Test Pyramid testing strategy, its consequences, and its causes.

I’ll update the full Making Software Quality Visible presentation as this series progresses. Feel free to send me feedback, thoughts, or questions via email or by posting them on the LinkedIn announcement corresponding to this post.

Continuing after Why the Chain Reaction hasn’t happened everywhere already and the main narrative thread from The Test Pyramid and the Chain Reaction

Many larger tests, few smaller tests

At the end of The Test Pyramid from The Test Pyramid and the Chain Reaction, I mentioned that:

Each test size validates different properties that would be difficult or impossible to validate using other kinds of tests. Adopting a balanced testing strategy that incorporates tests of all sizes enables more reliable and efficient development and testing—and higher software quality, inside and out.

Of course, many projects have a testing strategy that resembles an inverted Test Pyramid, with too many larger tests and not enough smaller tests.

Inverted Test Pyramid The Inverted Test Pyramid, representing too many larger tests, not enough smaller tests, and the complexity, risk, waste, and suffering it produces. Too many larger tests and not enough smaller tests

This leads to a number of common problems:

  • Tests tend to be larger, slower, less reliable
    The tests are slower and less reliable than they could be compared to relying more on smaller tests.

  • Broad scope makes failures difficult to diagnose
    Because large tests execute so much code, it might not be easy to tell what caused a failure.

  • Greater context switching cost to diagnose/repair failure
    That means developers have to interrupt their current work to spend significant time and effort diagnosing and fixing any failures.

  • Many new changes aren’t specifically tested because “time”
    Since most of the tests are large and slow, this incentivizes developers to possibly skip writing or running them because they “don’t have time.”

  • People ignore entire signal due to flakiness…
    Worst of all, since large tests are more prone to be flaky, people will begin to ignore test failures in general. They won’t believe their changes cause any failures, since the tests were failing before—they might even be flagged as “known failures.” And as I mention elsewhere…

  • …fostering the Normalization of Deviance
    …the Space Shuttle Challenger’s O-rings suffered from “known failures” as well, cultivating the “Normalization of Deviance” that led to disaster.

Flaky tests and how to handle them

(This section appears as a footnote in the original.)

“Flaky” means that a test will seem to pass or fail randomly without a change in its inputs or its environment. A test becomes flaky when it’s either validating behavior too specific for its scope, or isn’t adequately controlling all of its inputs or environment—or both. Common sources of flakiness include system clocks, external databases, or external services accessed via REST APIs.

A flaky test is worse than no test at all. It conditions developers to spend the time and resources to run a test only to ignore its results. Actually, it’s even worse—one flaky test can condition developers to ignore the entire test suite. That creates the conditions for more flakiness to creep in, and for more bugs to get through, despite all the time and resources consumed.

In other words, one flaky test that’s accepted as part of Business as Usual marks the first step towards the Normalization of Deviance.

There are three useful options for dealing with a flaky test:

  1. If it’s a larger test trying to validate behavior too specific for its scope, relax its validation, replace it with a smaller test, or both.
  2. If what it’s validating is correct for its scope, identify the input or environmental factor causing the failure and exert control over it. This is one of the reasons test doubles exist.
  3. If you can’t figure out what’s wrong or fix it in a reasonable amount of time, disable or delete the test.

Retries are ineffective and wasteful

(This section appears as part of the previous footnote in the original.)

Retrying flaky tests is NOT a viable remedy. It’s a microcosm of what I call in the full presentation the “Arms Race” mindset. Think about it:

  • Every time a flaky test fails, it’s consuming time and resources that could’ve been spent on more reliable tests.
  • Even if a flaky tests fails on every retry, people will still assume the test is unreliable, not their code, and will merge anyway.
  • Increasing retries only consumes more resources while enabling people to continue ignoring the problem when they should either fix, disable, or delete the test.
  • Bugs will still slip through, introduce risk, and create rework even after all the resources spent on retries.

Known failures are known risk and waste

(This section appears as a footnote in the original.)

The last thing you want to do with a flaky or otherwise consistently failing test is mark it as a “known failure.” This will only consume time and resources to run the test and complicate any reporting on overall test results.

Remember what tests are supposed to be there for: To let you know automatically that the system isn’t behaving as expected. Ignoring or masking failures undermines this function and increases the risk of bugs—and possibly even catastrophic system failure.

Assume you know that a flaky or failing test needs to be fixed, not discarded. If you can’t afford to fix it now, and you can still afford to continue development regardless, then disable the test. This will save resources and preserve the integrity of the unambiguous pass/fail signal of the entire test suite. Fix it when you have time later, or when you have to make the time before shipping.

Note I said “if you can still afford to continue development,” not “if you must continue development.” If you continue development without addressing problems you can’t afford to set aside, it will look like willful professional negligence should negative consequences manifest. It will reflect poorly on you, on your team, and on your company.

Also note I’m not saying all failures are necessarily worthy of stopping and fixing before continuing work. The danger I’m calling out is assuming most failures that aren’t quickly fixable are worth setting aside for the sake of new development by default. Such failures require a team discussion to determine the proper course of action—and the team must commit to a clear decision. The failure to have that conversation or to commit to that clear decision invites the Normalization of Deviance and potentially devastating risks.

Causes

Let’s go over some of the reasons behind this situation.

  • Features prioritized over internal quality/tech debt
    People are often pressured to continue working on new features that are “good enough” instead of reducing technical debt. This may be especially true for organizations that set aggressive deadlines and/or demand frequent live demonstrations.

    Frequent demos can be a very good thing—but not when making good demos is appreciated more than high internal software quality and sustainable development.

  • “Testing like a user would” is more important
    Again, if “testing like a user would” is valued more than other kinds of testing, then most tests will be large and user interface-driven.

  • Reliance on more tools, QA, or infrastructure (Arms Race)
    This also tends to instill the mindset that the testing strategy isn’t a problem, but that we always need more tools, infrastructure, or QA headcount. I call this the “Arms Race” mindset.

  • Landing more, larger changes at once because “time”
    Because the existing development and testing process is slow and inefficient, individuals try to optimize their productivity by integrating large changes at once. These changes are unlikely to receive either sufficient testing or sufficient code review, increasing the risk of bugs slipping through. It also increases the chance of large test failures that aren’t understood. The team is inclined to tolerate these failures, because there isn’t “time” to go back and redo the change the right way.

  • Lack of exposure to good examples or effective advocates
    As mentioned before, many people haven’t actually witnessed or experienced good testing practices before, and no one is advocating for them. This instills the belief that the current strategy and practices are the best we can come up with.

  • We tend to focus on what we directly control—and what management cares about! (Groupthink)
    In such high stress situations, it’s human nature to focus on doing what seems directly within our control in order to cope. Alternatively, we tend to prioritize what our management cares about, since they have leverage over our livelihood and career development. It’s hard to break out of a bad situation when feeling cornered—and too easy to succumb to Groupthink without realizing it.

So how do we break out of this corner—or help others to do so?

Coming up next

To answer this question, we’ll reconsider the challenge of making quality work and its results visible, then introduce “Vital Signs” as a comprehensive visibility approach.