Mike Bland

The Test Pyramid in Action

Copyright 2024 , licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Abstract

This introductory level training makes core automated testing concepts accessible to programmers who have had little to no practical experience with writing (good) tests. It contains a complete working example system demonstrating the Test Pyramid strategy, incorporating Roy Osherove’s String Calculator kata as a hands-on beginners’ exercise.

Slides and example repository

Status

I’m currently fleshing out this script, after which I’ll update the documentation and comments in the example project. Aside from that, the slide deck and example project are pretty much complete.

Table of contents

Introduction

I’m Mike Bland, and I’m going to talk about writing different “sizes” of automated tests to validate different properties of a software system. This strategy is commonly described using the Test Pyramid model, and we’ll see what this looks like using a complete web application as an example.

About me

mike-bland.com, github.com/mbland

Here’s a little bit of my background to illustrate how I came to care about this so much.

  • Taught myself unit testing at Northrop Grumman
    I “accidentally” discovered unit testing while working on shipboard navigation software at Northrop Grumman Mission Systems.

  • Led the Testing Grouplet at Google
    I became one of the leaders of the Testing Grouplet, a volunteer group that made automated testing an indispensable part of Google’s software development culture.

  • Led the Quality Culture Initiative at Apple
    For four years, I was the original leader of the Quality Culture Initiative, a similar volunteer group at Apple that continues to go strong today.

  • Wrote "Making Software Quality Visible"
    Recently I produced an extensive presentation called Making Software Quality Visible, a distillation of my career experience in coding, testing, and driving adoption.

Agenda

Today’s talk is a further distillation to introduce you to some of the core concepts, practices, and infrastructure supporting good automated testing practices.

  • The Test Pyramid and Key Testing Concepts
    I’ll give a high-level overview of the Test Pyramid model and some of the key concepts supporting it that we’ll see in the example.

  • Overview of Example Architecture and Technology
    We’ll walk through the structure of the example system briefly and I’ll list the tools involved and call out a couple of key features.

  • Test-Driven Development with the String Calculator Kata
    We’ll cover the very basics of Test-Driven Development with a brief demo and a brief hands-on exercise to give you a feel for the process.

  • Really Tying the Room Together
    Then we’ll end by discussing some of the goals and benefits of TDD and the Test Pyramid, and some extra guidance on effective automated testing.

If you have any questions or comments, feel free to raise them at any time. If I don’t have time to respond immediately, we can save the topic for discussion at the end or afterwards.

The Test Pyramid and Key Testing Concepts

Let’s start reviewing some of the fundamental concepts you’re going to see in action shortly. These are essential to understanding the example project and the testing demo and exercise.

What is automated testing?

Would anyone like to define what automated testing is?

[ WAIT FOR RESPONSES ]

This is how I like to define it. It’s:

The practice of writing programs to verify that our code and systems conform to expectations—i.e., that they fulfill requirements and make no incorrect assumptions.

Note that I didn’t say “conform to requirements.”

Image of an iceberg showing "Requirements" as the tip and
"Assumptions" as the bottom, representing "Expectations" as
the sum of both
Base image: Iceberg (from a site that seems no longer to exist)

Most of us usually validate that our code meets requirements, but often fall short of challenging the assumptions also embodied within our code.

Automated tests are reusable

The upfront cost to writing them pays off quickly and over time

The killer feature of well-written automated tests are that they’re automated, meaning they can run over and over all the time. You don’t have to make time to test things manually, or wait for someone else to do so. If you need to run the tests, run the tests, or set them up to run automatically.

Of course, you do need to make the time to learn to write good tests, and to actually write them. As with any new practice, there is a learning curve and investment of time and effort to get started. However, they payoff usually comes fairly quickly, and good tests actually keep paying off over time.

Before I reveal a few properties of automated testing that enable this payoff, does anyone want to guess what they might be?

[ WAIT FOR RESPONSES ]

  • They’re an essential ingredient of continuous integration pipelines
    Without automated tests, code submitted to your CI pipeline may build, and even run, but you may miss tons of preventable bugs creeping in. Then you’re stuck figuring out which change introduced a problem, and when, and why, instead of catching it immediately.

  • They take the burden off of human reviewers, QA, users to catch everything
    The whole point of continuous integration and automated testing is to keep humans from getting burned out or becoming scared to change anything.

  • When written well, they’re fast, reliable, and pinpoint specific problems
    Good tests not only enable humans to avoid toil and fear when they consistently pass, but also when they fail for good reason. They’re like production alerts pointing us at a specific problem from a specific code change.

  • They help increase velocity, reduce risk, and enable frequent merges by ensuring correctness automatically
    Because good tests run fast, reliably, and pinpoint specific problems, that means we humans can do more interesting work at a faster pace. We don’t fear change, and can make more changes faster when we trust our tests.

  • You spend more time delivering value, less time firefighting, debugging, fixing
    The ultimate result is that we feel more productive and more satisfied in our work, because we aren’t chasing down preventable problems all the time. This also makes us more valuable to the company, which should be good for our careers.

Questions or comments?

[ WAIT FOR RESPONSES ]

All this begs the question, though, of what makes for a “good” automated test.

The Test Pyramid

The Test Pyramid The Test Pyramid represents a balance of tests of different sizes for different purposes. A balance of tests of different sizes for different purposes

The Test Pyramid model can help us do a better job of validating our expectations by helping us write more effective automated tests.

Has anyone heard of the Test Pyramid before? If so, how would you describe it?

[ WAIT FOR RESPONSES ]

I like to describe it as:

A balance of tests of different sizes for different purposes.

It’s not a perfect model—no model is—but it’s an effective tool for starting a productive conversation about testing strategy.

Detailed Breakdown of the Test Pyramid A detailed breakdown of the Test Pyramid showing how the properties of each test size render them all essential to an efficient, effective testing strategy. Size Scope Ownership Code visibility Depend- encies Control/ Reliability/ Independ- ence Resource usage/ Maint. cost Speed/ Feedback Loop Confidence Large (System, E2E) Entire system QA, some developers Details not visible All Low High Slow Entire system Medium (Integration) Components, services Developers, some QA Some details visible As few as possible Medium Medium Faster Contract between components Small (Unit) Functions, classes Developers All details visible Few to none High Low Fastest Low level details, individual changes

The same information as above, but in a scrollable HTML table:

Size Scope Ownership Code
visibility
Dependen­cies Control/ Reliability/ Independ­ence Resource usage/ Maint. cost Speed/ Feedback loop Confidence
Large
(System,
E2E)
Entire
system
QA, some
developers
Details not visible All Low High Slow Entire
system
Medium
(Integration)
Components, services Developers, some QA Some details visible As few as possible Medium Medium Faster Contract
between components
Small
(Unit)
Functions, classes Developers All details visible Few to none High Low Fastest Low level details, individual changes

It helps us understand how different kinds of tests give us confidence in different levels and properties of the system. It can also help us break the habit of writing large, expensive, flaky tests by default.

  • Small tests are unit tests that validate only a few functions or classes at a time with very few dependencies, if any. They often use test doubles in place of production dependencies to control the environment, making the tests very fast, independent, reliable, and cheap to maintain. Their tight feedback loop enables developers to detect and repair problems very quickly that would be more difficult and expensive to detect with larger tests. They can also be run in local and virtualized environments and can be parallelized.

  • Medium tests are integration tests that validate contracts and interactions with external dependencies or larger internal components of the system. While not as fast or cheap as small tests, by focusing on only a few dependencies, developers or QA can still run them somewhat frequently. They detect specific integration problems and unexpected external changes that small tests can’t, and can do so more quickly and cheaply than large system tests. Paired with good internal design, these tests can ensure that test doubles used in small tests remain faithful to production behavior.

  • Large tests are full, end to end system tests, often driven through user interface automation or a REST API. They’re the slowest and most expensive tests to write, run, and maintain, and can be notoriously unreliable. For these reasons, writing large tests by default for everything is especially problematic. However, when well designed and balanced with smaller tests, they cover important use cases and user experience factors that aren’t covered by the smaller tests.

Thoughtful, balanced strategy == Reliability, efficiency

Each test size validates different properties that would be difficult or impossible to validate using other kinds of tests. Adopting a balanced testing strategy that incorporates tests of all sizes enables more reliable and efficient development and testing—and higher software quality, inside and out.

Any questions before we move on?

[ WAIT FOR RESPONSES ]

Inverted Test Pyramid

Thinking about test sizes helps avoid writing too many larger tests and not enough smaller tests, a common antipattern known as the Inverted Test Pyramid.

Inverted Test Pyramid The Inverted Test Pyramid, representing too many larger tests, not enough smaller tests, and the complexity, risk, waste, and suffering it produces. Too many larger tests and not enough smaller tests

This leads to a number of common problems:

  • Tests tend to be larger, slower, less reliable
    The tests are slower and less reliable than they could be compared to relying more on smaller tests.

  • Broad scope makes failures difficult to diagnose
    Because large tests execute so much code, it might not be easy to tell what caused a failure.

  • Greater context switching cost to diagnose/repair failure
    That means developers have to interrupt their current work to spend significant time and effort diagnosing and fixing any failures.

  • Many new changes aren’t specifically tested because “time”
    Since most of the tests are large and slow, this incentivizes developers to possibly skip writing or running them because they “don’t have time.”

  • People ignore entire signal due to flakiness…
    Worst of all, since large tests are more prone to be flaky,1 people will begin to ignore test failures in general. They won’t believe their changes cause any failures, since the tests were failing before—they might even be flagged as “known failures.”2

  • …fostering the Normalization of Deviance
    This cultivates a phenomenon called the “Normalization of Deviance.”

Normalization of Deviance

Coined by Diane Vaughan in [The Challenger Launch Decision][vaughan]

Diane Vaughan introduced this term in her book about the Space Shuttle Challenger explosion in January 1986.3 My paraphrased version of the definition is:

A gradual lowering of standards that becomes accepted, and even defended, as the cultural norm.

Explosion of the Space Shuttle Challenger shortly after takeoff on January 28,
1986.
The explosion of the Space Shuttle Challenger shortly after takeoff on January 28, 1986 exposed the potentially deadly consequences of common organizational failures.
Image from https://commons.wikimedia.org/wiki/File:Challenger_explosion.jpg. In the public domain from [NASA][].

She derived this concept from the fact that nearly everyone ignored performance deviations in the rocket booster O-rings from 14 of the previous 17 missions. Those who did express concern were pressured and overruled, since a launch is very expensive and nothing bad had yet happened.

You may not work on Space Shuttle software, but the same principle applies. A culture of ignoring problems often leads to much bigger problems.

Inverted Test Pyramid

Causes

Let’s go over some of the reasons why the Inverted Test Pyramid and Normalization of Deviance often take shape in an organization.

  • Features prioritized over internal quality/tech debt
    People are often pressured to continue working on new features that are “good enough” instead of reducing technical debt. This may be especially true for organizations that set aggressive deadlines and/or demand frequent live demonstrations.4

  • “Testing like a user would” is more important
    Again, if “testing like a user would” is valued more than other kinds of testing, then most tests will be large and user interface-driven.

  • Reliance on more tools, QA, or infrastructure (Arms Race)
    This also tends to instill the mindset that the testing strategy isn’t a problem, but that we always need more tools, infrastructure, or QA headcount. I call this the “Arms Race” mindset.

  • Landing more, larger changes at once because “time”
    Because the existing development and testing process is slow and inefficient, individuals try to optimize their productivity by integrating large changes at once. These changes are unlikely to receive either sufficient testing or sufficient code review, increasing the risk of bugs slipping through. It also increases the chance of large test failures that aren’t understood. The team is inclined to tolerate these failures, because there isn’t “time” to go back and redo the change the right way.

  • Lack of exposure to good examples or effective advocates
    As mentioned before, many people haven’t actually witnessed or experienced good testing practices before, and no one is advocating for them. This instills the belief that the current strategy and practices are the best we can come up with.

  • We tend to focus on what we directly control—and what management cares about! (Groupthink)
    In such high stress situations, it’s human nature to focus on doing what seems directly within our control in order to cope. Alternatively, we tend to prioritize what our management cares about, since they have leverage over our livelihood and career development. It’s hard to break out of a bad situation when feeling cornered—and too easy to succumb to Groupthink without realizing it.

The point is that merely automating our tests isn’t enough; we have to be thoughtful about automating them the right way. The sooner we adopt good coding and testing habits, the greater our chances of avoiding these negative outcomes.

Any questions?

[ WAIT FOR RESPONSES ]

Key Concepts

Principles, practices, and shared language

Even though we now know that not one test size fits all, there are a few key concepts informing effective tests of all sizes.

  • Tests should be designed to fail: naming and organization can clarify intent and cause of failure
    The goal of testing isn’t to make sure tests always pass no matter what. The goal is to write tests that let us know, reliably and accurately, when our expectations of the code’s behavior differ from reality.

  • Use the Arrange-Act-Assert pattern (a.k.a. Given-When-Then) to keep tests focused and readable
    Use the Arrange-Act-Assert (or Given-When-Then) pattern to separate the setup, execution, and validation phases of each test case. This helps ensure each test case remains understandable by keeping each one focused on a specific behavior.

  • Extract helpers to reduce complicated setup, execution, or assertions
    Extract helper methods, classes, or even subsystems to ensure test cases remain brief and read clearly. This is especially important for operations that may be shared across several tests, but can be valuable when writing even a single test case sometimes.

  • Interfaces/seams enable composition, dependency breaking w/ test doubles
    We often have to work with legacy code bases with few tests, if any. So we also need to make safe changes to existing code that enable us to begin improving code quality and adding tests. Michael Feathers’s Working Effectively with Legacy Code is the seminal tome on this subject, showing how to gently break dependencies to introduce seams. “Seams” are points at which we introduce abstract interfaces that enable test doubles to stand in for our dependencies, making tests faster and more reliable.

Speaking of interfaces, Scott Meyers, of Effective C++ fame, gave perhaps the best design advice of all for writing testable, maintainable, understandable code in general:

“Make interfaces easy to use correctly and hard to use incorrectly.”

Scott Meyers, The Most Important Design Guideline?

To propose a slight update to make it more concrete:

“Make interfaces easy to use correctly and hard to use incorrectly—like an electrical outlet.”

—With apologies to Scott Meyers, The Most Important Design Guideline?

Image of a 120V 15A electrical outlet representing what Scott Meyers calls
"The Most Important Design Guideline": Make interfaces easy to use correctly and
hard to use incorrectly.
I took this picture of an electrical outlet myself, because I couldn’t find a good free one.

Of course, it’s not impossible to misuse an electrical outlet, but it’s a common, wildly successful example that people use correctly most of the time. Making software that easy to use or change correctly and as hard to do so incorrectly may not always be possible—but we can always try.

Dependency Injection

Interfaces, or seams, are what enable us to apply the “dependency injection” technique to keep our code isolated from its production dependencies. This makes the code easier to test, of course, but it also makes the design of the system more understandable and maintainable in general.

All “dependency injection” means is:

Receiving references to collaborator interfaces as function parameters instead of creating or accessing collaborator implementations directly

So here’s a small example of code that doesn’t use dependency injection:

Before:

public Consumer() {
  this.foo = new ProdFoo();
  this.bar = ProdBar.instance();
}

Consumer creates one collaborator, and access a singleton instance directly. This means to test Consumer, we also have to manage the setup and execution of ProdFoo and ProdBar. These dependencies can be complex, slow, and difficult to control, making testing Consumer painful and unreliable.

This is how the code looks after updating it to inject these dependencies:

After:

public Consumer(FooProducer f, BarProducer b) {
  this.foo = foo;
  this.bar = bar;
}

We no longer create ProdFoo or access ProdBar directly, but we work with their abstract interfaces instead. Now our tests can use lightweight test doubles to exercise Consumer in isolation, yet easily configure Consumer with its production dependencies at runtime.

Test Doubles

Does anyone already know what I mean by the term “test doubles?”

[ WAIT FOR RESPONSES ]

Test doubles are…

Lightweight, controllable objects replacing production dependencies

I always love using my favorite concrete analogy here: practicing alone through a practice amp before plugging into a wall of Marshalls onstage. You still have to rehearse and do a soundcheck with the band, but you need to get your own chops into shape first.

Image of a Marshall MS-2 Micro Amp compared to a wall of Marshall stacks, illustrating the principle of Test Doubles implementing a production dependency interface.
Like test doubles, practice amps make it faster, easier, and less nerve-racking to check your work before going into production.
This image was derived from:
Marshall MS-2 Micro Amp (direct link); Reverb: A History Of Marshall Amps: The Early Years (direct link).

Because test doubles are so lightweight and controllable, they enable us to write faster, more reliable, more thorough tests for small components in isolation. Another major benefit is that they enable separate individuals or teams to work on separate components in parallel, provided stable component interface contracts. We’ll see a sample of how this works when we walk through the example project.

Dependency Injection (continued)

Extract a pure abstract interface from concrete dependencies, then have the dependency—and test doubles—implement the interface.

This is what enabling dependency injection looks like from the point of view of a production dependency that you control.

Before:

class ProdFoo {
  public ProdFoo() {
    // initialize prod dependencies
  }
}

We extract a pure abstract interface based on the existing ProdFoo interface and update ProdFoo to implement that interface explicitly.

After:

class ProdFoo implements Foo {
  public ProdFoo() {
    // initialize prod dependencies
  }
}

That also sets us up for our next step, implementing a test double to stand in for the production dependency:

class TestFoo implements Foo {
  public TestFoo() {
    // initialize test dependencies
  }
}

Usually this is pretty easy, but there are circumstances where doing this in legacy code may not prove so straightforward. The good news is that Michael Feathers’s book provides techniques for getting there one step at a time, but that’s for another presentation.

Dependency Injection with Internal API

For external dependencies, use an Internal API wrapper/adapter/proxy for testability and to limit exposure to changes beyond your control.

Before:

class ExternalFoo {
  public ExternalFoo() {
    // initialize prod dependencies
  }
}

Dependency injection also enables us to reduce our system’s exposure to external dependencies by wrapping them with our own “internal API.”

After:

class FooWrapper implements Foo {
  private ExternalFoo foo;
  public FooWrapper(ExternalFoo f) {
    this.foo = f;
  }
}

This makes it easier to write isolated integration tests against the dependency, called “contract” or “collaboration tests.” At the same time, it becomes easier to test the rest of our system, while using an API that better matches our application domain.

It also helps avoid version or vendor lock-in. If something in the dependency changes, or a major upgrade comes out, you generally only have to update a single class, not your whole system. Trying or switching to a different implementation, such as moving from MySQL to Postgres, becomes much easier.

Dependency Injection with Weld

You shouldn’t need a framework, but when in Rome…

For some reason, the Java community thought it was a good idea to develop dependency injection frameworks instead of wiring up dependencies directly from main(). In particular, the Servlet API insists on instantiating Servlets with a zero-argument constructor. So instead of constructor injection, you have two options: create an injectable parent class, and derive a subclass with production dependencies; or use a framework.

Normally I’d avoid DI frameworks, but given their prevalence, I’ve used Weld, the standard Java reference implementation, for this example. In a lot of cases, you can still use constructor injection by adding the @Inject annotation to the constructor.

Preferred:

class Servlet extends HttpServlet {
  @Inject
  public Servlet(Calc calc) {
    this.calculator = calc;
}

Weld will then find the right implementation at runtime, based on annotations added to your production classes and other configuration magic.

However, when it comes to servlets, which are instantiated using zero-argument constructors this isn’t an option, so you have to use private field injection.

Last resort (ick! but if necessary):

class Servlet extends HttpServlet {
  @Inject private Calc calculator;
}

At runtime, the container—Tomcat in our case—will instantiate the Servlet, and then allow Weld to initialize the private field. I find this personally weird and distasteful, but that’s the enterprise Java ecosystem for you.

However, you can still use both constructor and field injection in the same class. We’ll see in the example code where we use constructor injection in our smaller tests, and field injection in our larger tests.

Dependency Injection (continued again)

Here’s the ultimate point of dependency injection:

Prefer interface-based composition to implementation inheritance because it’s more explicit, understandable, flexible, less error-prone due to invisible default behaviors—and builds faster.

Interfaces/Seams + Composition >>> Implementation Inheritance

Let’s break down why this is the case. I’m going to break the golden rule of not reading a slide verbatim, because I think it’s worth being totally explicit here, however boring.

  • The biggest obstacle to good testing is large, complicated, slow, unreliable dependencies that you can’t control—especially inherited ones.
  • The next biggest obstacle is using large tests by default for everything and making them too specific to remain reliable as the system evolves.
  • Interfaces/seams enable composable design via dependency injection.
  • DI provides control over dependency behavior in tests, enabling thorough testing of each component in isolation, reducing reliance on larger tests.

Any questions or comments?

[ WAIT FOR RESPONSES ]

Overview of Example Architecture and Technology

Now let’s see these concepts in action in the example project.

Goals of the Example Project

As we skim through the code, please keep in mind the specific goals I was aiming for while creating it. Again, I’m going to be boring and just read the points here.

I wanted to…

  • Go beyond a basic, isolated exercise to demonstrate tests at every level of the Test Pyramid—showing how unit testing is essential to the big picture
  • Resemble a production application, with minimal functionality and complexity to allow the design and testing strategy to stand out
  • Show how to develop and test every level of the system independently, while integrating into a cohesive whole—via interfaces, dependency injection, test doubles
  • Provide ample examples and documentation to support further study

Full Test Suite and App Demo

github.com/mbland/tomcat-servlet-testing-example

Here we go…

[ FOLLOW LINK AND SKIM AROUND USING THE FOLLOWING SLIDES TO ANCHOR THE OVERVIEW ]

Example Architecture and Technology

Example Testing Technology

Reusable Test Helpers and Other Packages

Utilities that have proven useful beyond a single project

  • test-page-opener (npm): enables validating initial page state without Selenium
  • rollup-plugin-handlebars-precompiler (npm): began as a test helper, led to core architectural pattern
  • TestTomcat (Java): runs Tomcat in the same process for medium JUnit and large Selenium tests

Tomcat Servlet Testing Example technology matrix

Here’s a matrix mapping most of these technologies to their role in the frontend or backend relative to each test size.

Tomcat Servlet Testing Example technology matrix Matrix of Tomcat Servlet Testing Example technologies by test size and frontend/backend Size Frontend / JavaScript / pnpm Backend / Java / Gradle + pnpm Large (System, E2E) N/A Selenium WebDriver TestTomcat, Weld, headless Chrome (No code coverage for large tests) Medium (Integration) Vitest + Vite JSDom, chrome, test-page-opener Istanbul, Node.js JUnit 5, Hamcrest matchers TestTomcat, StringCalculator doubles JaCoCo (for smaller medium tests) Small
 (Unit) Vitest + Vite JSDom, headless Chrome Istanbul, Node.js JUnit 5 No test helpers in this example JaCoCo Testing framework Test helpers Coverage library CI: GitHub Actions Coverage report: coveralls.io

The same information as above, but in a scrollable HTML table:

Size Frontend / JavaScript / pnpm Backend / Java / Gradle + pnpm
Large
(System, E2E)
N/A Selenium WebDriver
TestTomcat, Weld, headless Chrome
(No code coverage for large tests)
Medium
(Integration)
Vitest + Vite
JSDom, chrome, test-page-opener
Istanbul, Node.js
JUnit 5, Hamcrest matchers
TestTomcat, StringCalculator doubles
JaCoCo (for smaller medium tests)
Small
(Unit)
Vitest + Vite
JSDom, headless Chrome
Istanbul, Node.js
JUnit 5
No test helpers in this example
JaCoCo

Key:

  • Testing framework
  • Test helpers
  • Coverage library

Other info:

  • CI: GitHub Actions
  • Coverage report: coveralls.io

Gradle configuration

strcalc/build.gradle.kts, et. al.

  • Also: settings.gradle.kts, gradle.properties, gradle/libs.versions.toml
  • Builds and tests both the frontend and backend project components
  • Centered around the Gradle War Plugin—frontend builds into strcalc/build/webapp, backend Tomcat config in strcalc/src/webapp
  • Custom schema to separate Small/Medium/Large Java tests
  • Several other little tricks, documented in the comments

Handlebars component pattern

Wraps the Handlebars API to precompile + bundle via Vite, Rollup

  • Handlebars templates reside in *.hbs files
  • Components using them reside in corresponding *.js files
  • Vite/Vitest uses rollup-plugin-handlebars-precompiler to transform *.hbs files into JavaScript modules containing precompiled Handlebars templates
  • The *.hbs files are imported directly, Template() returns DocumentFragment
  • Each *.hbs + *.js component tested by corresponding *.test.js

Page Object Pattern

selenium.dev/documentation/test_practices/encouraged/page_object_models

  • Useful for large Selenium tests, some small frontend component tests
  • Encapsulates details of a specific page’s structure
  • Exposes meaningful methods used to write tests
  • Don’t contain assertions, but make test assertions easier to write and read
  • Can be composed of Page Objects representing individual components (though this example doesn’t go that far)

Test-Driven Development with the String Calculator Kata

Now that you’ve got the lay of the land, let’s add the core business logic to our application using Test-Driven Development.

The Red-Green-Refactor cycle

The core process of TDD consists of three steps known as the “Red-Green-Refactor” cycle:

Red-Green-Refactor Cycle Illustration of the Red-Green-Refactor Cycle from Test-Driven Development Write failing test Make test pass Improve code quality
  • Write a failing test before writing the production code.
  • Make the test pass by writing only enough production code to do so.
  • Improve code quality by refactoring the existing code before writing the next failing test.

Any questions so far?

[ WAIT FOR RESPONSES ]

Refactoring

As defined by Martin Fowler in Refactoring

For those who are familiar with the term, how would you define “refactoring”?

[ WAIT FOR RESPONSES ]

This is how Martin Fowler, the author of the Refactoring book, defines the term:

Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

Its heart is a series of small behavior preserving transformations. Each transformation (called a “refactoring”) does little, but a sequence of these transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each refactoring, reducing the chances that a system can get seriously broken during the restructuring.

I placed the emphasis on “small behavior preserving transformations” and “the system is kept fully working.” Refactoring is not rewriting the whole thing, or breaking it for days or weeks on end. Having good tests is essential to being able to refactor a little bit at a time, all the time, to keep code healthy. This way you aren’t ever tempted to attempt big, risky rewrites.

Any thoughts or questions about refactoring?

[ WAIT FOR RESPONSES ]

String Calculator Kata, Step 1

osherove.com/tdd-kata-1

Let’s dive into our TDD example, the String Calculator by Roy Osherove.

  • Create a StringCalculator with the method: int Add(String numbers)
  • Take up to two numbers, separated by commas, and return their sum
  • For example: “” (return 0) or “1” or “1,2” as inputs
  • Start with the empty string, then move to one number, then and two numbers
  • Solve things simply; force yourself to write tests you might not think about
  • Refactor after each passing test

Setup Process

This is how we’re going to get started:

  • Set up and test basic infrastructure (this part’s already done)
  • Write a test that fails to compile—compilation failures are failures!
  • Write just enough code for the test to compile and pass
  • It make take a while to set up new infrastructure, but it pays off immediately

Setup Process Demo

[ PERFORM THE FIRST PART OF THE DEMO, RUNNING ALL TESTS AS THE FIRST STEP ]

Setup Process Steps

Breaking it down microscopically

Here you can see this is exactly what I just did.

  • Ensure project builds
    Instantiate class
    Write class
    Add test for “”
    Write method, return 0

I made sure the project could build before I did anything else, then followed the steps to introduce a failing test before making a change. Any questions?

[ WAIT FOR RESPONSES ]

Red-Green-Refactor Process

Now I’ll follow the process to add one piece of behavior at a time:

  • Decide on a behavior to implement
  • Write a failing test for the new piece of behavior
  • Write just enough code to get the new test to pass
  • Improve the existing code if desired, ensuring all tests continue to pass
  • Repeat for next behavior

Red-Green-Refactor Process Demo

[ PERFORM THE SECOND PART OF THE DEMO, RUNNING ALL TESTS AS THE LAST STEP ]

Red-Green-Refactor Process Steps

Again, you can see this is exactly what I just did.

Add test for “1”
Implement behavior
Add test for “1,2”
Implement behavior
Refactor

Add a failing test, add enough code to make it pass while ensuring all other tests pass, then refactor if desired. We proceed methodically, leaving things broken for only a few minutes or seconds. That way we remain in control of the changes we’re making at every step.

Questions?

[ WAIT FOR RESPONSES ]

String Calculator Kata Exercise Instructions

osherove.com/tdd-kata-1

Now it’s your turn to have some fun!

[ BRIEFLY REVIEW THE GUIDANCE BEFORE BEGINNING THE EXERCISE ]

  • git clone https://github.com/mbland/tomcat-servlet-testing-example
  • Consider finding a partner to pair program (encouraged, not required)
  • Try to finish step 2: Allow add() to handle an unknown amount of numbers
  • Go further if there’s time, but don’t feel pressured to do the whole thing
  • Call out if you have any questions or need any assistance
  • See you in 15 minutes!

[ ASK RETURNING PARTICIPANTS HOW THE EXERCISE WENT ]

Really Tying the Room Together

Now that you’re aware of the Test Pyramid, and have seen it in action, I’d like to share some more helpful details and guiding principles.

StringCalculator could be…

…any self-contained component of business logic

For our purposes, StringCalculator represents any arbitrary dependency or piece of business logic within an entire system. The point is that with the right interface, we can work more easily with any implementation, such as:

  • A Database or HTTP API access layer
  • A service/servlet from another process, local or remote
  • A backend…which we’re already substituting in the frontend!
  • A test double…which we’re already using in ServletContractTest!

Image of a Marshall MS-2 Micro Amp compared to a wall of Marshall stacks, illustrating the principle of Test Doubles implementing a production dependency interface.

Any questions or comments?

[ WAIT FOR RESPONSES ]

Testing, “Mocking,” and DI Frameworks

Here are a few things to keep in mind about the various frameworks that are available to help you write tests.

  • Testing frameworks help organize common setup, teardown, and other details into “test fixtures” to keep individual test cases small and focused
    Actual basic testing frameworks exist to help organize your individual test cases while sharing common resources and setup and teardown methods. These clusters of shared resources and methods are known as “test fixtures”.

  • JUnit, Vitest, and many others descend from the xUnit concept
    Most common testing frameworks descend from the popular xUnit pattern in some way, the most obvious and well known being JUnit. However, even more recent frameworks like pytest, mocha, Vitest, etc., still implement many core xUnit concepts in their own way.

  • “Mocking” frameworks are helpful for creating test doubles, but not essential
    So-called “mocking” frameworks make it easier to write and maintain test doubles. I say so-called because a “mock” is only one kind of test double, and is actually the kind you should use as a last resort. Stubs, spies, and fakes are preferable in most cases. Either way, “mocking” framework is the name that has stuck, and while these frameworks are often quite convenient, they’re not essential. You can still write your own test doubles without them if need be.

  • You don’t need a dependency injection framework—main() will do…
    In general, a dependency injection framework is never really necessary. You can always wire up your object graph in main() or using your testing framework’s test fixture mechanism.

  • …unless a Java library practically forces you to use one (e.g., Tomcat + Weld)
    However, you may encounter DI frameworks in the Java world, especially when implementing code to plug into an enterprise framework like Tomcat. Technically, you can still apply DI without a framework even in this case, but it’s arguably better to just reach for the nearest framework instead.

Any questions?

[ WAIT FOR RESPONSES ]

Test Fixtures

Shared state, helpers, and lifecycle methods

Let’s learn some details about how JUnit and Vitest, in particular, help us organize our test cases into test fixtures.

  • describe in Vitest, normal class with @Test methods in JUnit
    Test fixtures are encapsulated within describe blocks in Vitest. Old versions of JUnit used to require inheriting from a base class, and some other common frameworks still require this. JUnit 5 however considers any class containing a method with an @Test annotation to be a test fixture.

  • Add data members, objects, and helpers as necessary
    Fixtures should contain any data members, objects, and helper functions or objects your test cases require. The testing framework will use the fixture object to set up and tear down these shared resources automatically.

  • beforeAll runs before all test methods, beforeEach runs before each
    In both frameworks, you’ll see functions named or annotated with the name beforeAll or beforeEach. The framework calls beforeAll to set up any necessary state before running any of the test case methods, and beforeEach before every one of them.

  • Converse for afterEach, afterAll
    The converse is true of the afterEach and afterAll methods, which are used to clean up any state generated by test cases.

  • setUp and tearDown variations correspond to beforeEach, afterEach
    These historically used to be named setUp and tearDown, but the before and after names have become more popular recently.

  • before/after methods ensure consistent environment for every test case
    The point of the before and after methods is to ensure every test case runs under a consistent and correct environment.

In summary, test fixtures keep the logic for each test case small, focused, and separate from other test cases. This makes tests easier to write, understand, and maintain. Questions?

[ WAIT FOR RESPONSES ]

Test Naming

No, a rose by any other name will not smell as sweet.

TODO(mbland): Continue from here

  • Include the class and/or function or method name in the fixture name

  • Have the name tell you something concrete when you read it and when it fails

  • A bad test case name requires deeper investigation: testFoo()

  • Better test case names in FooTest fixture indicate nature of tests/failures: returnsBarOnSuccess(), throwsFooExceptionIfBarUnavailable()

Test Organization

Make tests readable by making information explicit, yet concise.

  • Long test cases with short names are doing too much to understand easily…
  • …but don’t go overboard with overly microscopic cases and long names

  • Test cases can look very similar, so long as differences jump out

  • Avoid conditionals—test cases should read and behave like individual requirements

  • Write setup helpers, custom assertions for common operations (possibly containing conditionals)

Avoid table-driven tests

Wrong trade-off between duplication and maintainability

  • Tables can be OK in some cases, but are often the wrong choice

  • They reduce duplication, but break when input+expected data structures change, often contain conditionals, are harder to understand when they fail

  • Prefer to write a custom validation function (or functions) instead

  • Give these custom validators meaningful names

  • Then collapse very closely related inputs into related cases if appropriate

Test-Driven Development

Essential principles

  • Essentially about reducing unnecessary complexity, risk, waste, and suffering

  • Is an effective methodology to aid understanding—not a shortcut or a religion

  • More about developing a “Quality Mindset” than a specific technique or tech

  • Is most accessible to newcomers via writing small unit tests…

  • …but applies to system design and testing at all levels of the Test Pyramid

Test-Driven Development goals

What we want to achieve on a technical level

  • Focus on and simplify one behavior/problem at a time
  • Make it easy to keep tests passing by not doing too much at once
  • Keep tests isolated from each other, keep them fast
  • Control your dependencies so they don’t control you—flakiness is the enemy
  • Encourage the Single Responsibility Principle, composition, polymorphism
  • See: SOLID principles, FIRST principles

Benefits of TDD/Testing/Testable Code

What we want to achieve on a business/quality of life level

  • Testable code/architecture is maintainable—tests add design pressure, enable continuous refactoring
  • Increased ability to work independently, especially for new hires
  • Streamlined code reviews by keeping focus on new/changed code
  • Instill greater confidence and enable faster feature velocity
  • Increased understanding—reduced complexity, risk, waste, and suffering

Test First vs. Test With

Don’t get hung up thinking that if you don’t test first, you don’t test at all.

  • Test-First is great if you already know what you want, or if it helps you think through what you want
  • Test-With requires more careful validation of results, but may fit your brain better than Test-First most of the time
  • It’s more essential that there are good, fast, thorough, understandable, reliable tests with your changes than exactly when you wrote them.

Exploration, Settlement, and Vision

Writing the right tests when the time’s right

Phase Focus To test or not to test?
Exploration Learning the shape of the problem and solution spaces, rapid prototyping Not:
Still learning basic things, don’t commit too early
Settlement Parts of the problem coming into focus, beginning to depend on core components Test:
Core components must keep working while exploring further
Vision All expectations and the end goal are clear Test:
Confirm expectations hold up to reality, ensure others aligned with vision

Continuous integration/short lived branches

Break early, break often, fix immediately

  • git is designed to enable frequent merging of many branches
  • Small, short-lived branches ensure changes stay in sync with one another
  • Automated testing is essential to validating this merging and synchronization
  • Continuous integration provides the system of record and source of truth
  • Short-lived branches + automated testing == continuous integration
  • Attention remains focused on code review, not fear or debugging

Code Coverage

Use it as a tool to provide visibility, not merely as a goal

  • 100% is ideal, not strictly necessary—but exceptions should be noted
  • It can tell you when you’re definitely not done, not when you are done
  • How you get it is more important than how much you get
  • Collect from small-ish tests, each covering little, collectively covering much
  • Large-ish tests should define contract/use case coverage, not code coverage
  • Can help with aggressive refactoring when done well (extracting objects, etc.)

The Quality Mindset

Transcending test sizes, technologies, or application domains

  • Don’t submit copy/pasted code! Duplication hides bugs. (e.g., goto fail)
  • Send several small, understandable changes for review, not one big one
  • Reproduce bugs with the smallest test possible, then fix them
  • Strive to minimize unnecessary complexity, risk, waste, and suffering!
  • Strive to maximize clarity, confidence, efficiency, and delight!

mike-bland.com
/test-pyramid-in-action
/making-software-quality-visible

Footnotes

  1. “Flaky” means that a test will seem to pass or fail randomly without a change in its inputs or its environment. A test becomes flaky when it’s either validating behavior too specific for its scope, or isn’t adequately controlling all of its inputs or environment—or both. Common sources of flakiness include system clocks, external databases, or external services accessed via REST APIs.

    A flaky test is worse than no test at all. It conditions developers to spend the time and resources to run a test only to ignore its results. Actually, it’s even worse—one flaky test can condition developers to ignore the entire test suite. That creates the conditions for more flakiness to creep in, and for more bugs to get through, despite all the time and resources consumed.

    In other words, one flaky test that’s accepted as part of Business as Usual marks the first step towards the Normalization of Deviance.

    There are three useful options for dealing with a flaky test:

    1. If it’s a larger test trying to validate behavior too specific for its scope, relax its validation, replace it with a smaller test, or both.
    2. If what it’s validating is correct for its scope, identify the input or environmental factor causing the failure and exert control over it. This is one of the reasons test doubles exist.
    3. If you can’t figure out what’s wrong or fix it in a reasonable amount of time, disable or delete the test.

    Retrying flaky tests is NOT a viable remedy. It’s a microcosm of what I call in this presentation the “Arms Race” mindset. Think about it:

    • Every time a flaky test fails, it’s consuming time and resources that could’ve been spent on more reliable tests.
    • Even if a flaky tests fails on every retry, people will still assume the test is unreliable, not their code, and will merge anyway.
    • Increasing retries only consumes more resources while enabling people to continue ignoring the problem when they should either fix, disable, or delete the test.
    • Bugs will still slip through, introduce risk, and create rework even after all the resources spent on retries.

  2. The last thing you want to do with a flaky or otherwise consistently failing test is mark it as a “known failure.” This will only consume time and resources to run the test and complicate any reporting on overall test results.

    Remember what tests are supposed to be there for: To let you know automatically that the system isn’t behaving as expected. Ignoring or masking failures undermines this function and increases the risk of bugs—and possibly even catastrophic system failure.

    Assume you know that a flaky or failing test needs to be fixed, not discarded. If you can’t afford to fix it now, and you can still afford to continue development regardless, then disable the test. This will save resources and preserve the integrity of the unambiguous pass/fail signal of the entire test suite. Fix it when you have time later, or when you have to make the time before shipping.

    Note I said “if you can still afford to continue development,” not “if you must continue development.” If you continue development without addressing problems you can’t afford to set aside, it will look like willful professional negligence should negative consequences manifest. It will reflect poorly on you, on your team, and on your company.

    Also note I’m not saying all failures are necessarily worthy of stopping and fixing before continuing work. The danger I’m calling out is assuming most failures that aren’t quickly fixable are worth setting aside for the sake of new development by default. Such failures require a team discussion to determine the proper course of action—and the team must commit to a clear decision. The failure to have that conversation or to commit to that clear decision invites the Normalization of Deviance and potentially devastating risks. 

  3. I first learned about this concept from an Apple internal essay on the topic. 

  4. Frequent demos can be a very good thing—but not when making good demos is appreciated more than high internal software quality and sustainable development.