Small, Medium, Large

The Testing Grouplet's terminology for getting Google engineers to think about the different scopes of automated tests

01 Nov 2011 - New York
Tags: Google, SML, Test Certified, Testing Grouplet, grouplets, technical

“Small”, “medium”, and “large” (plus, later, “enormous”) were the replacement terms, introduced by the Testing Grouplet in 2006, for “unit” and “regression” tests at Google. They became part of the core vocabulary for Test Certified (TC) and Testing on the Toilet (TotT), and as such provided a new conceptual framework that influenced the way Google engineers thought about and wrote both automated tests and production code. Though the small/medium/large concepts were eventually mapped onto development and test tools, they solved a problem that technology alone could not address: How to get engineers to reason more effectively about how to design their code, interfaces and systems for testability.

What’s past is prologue
A rose by any other name
Words, words, words
Sound and fury
To hold as ’twere the mirror up to nature
Must like a whore unpack my heart with words

This may seem a lot of explanation behind a pretty straightforward mapping from “unit” => “small”, “integration” => “medium”, and “system” => “large”, but as with many things in life, and in particular language, there’s a historical context from which this tactic emerged. Reasons. Google was a very different place in 2005, the year the Testing Grouplet started, in terms of its development environment and common practices.

What’s past is prologue

In the beginning, the Google build system supported “unit” tests and “regression” tests. At that time, Google engineering wasn’t very precise about the existing terms: “unit” test didn’t usually launch multiple processes and talk to servers running beyond the environment of the test, and ran no longer than a few minutes; a “regression” test could potentially interact with production services and take hours to run. In practice, a fuzzy boundary based on execution time was the deciding factor between the two, and it was more of a rule-of-thumb than an agreed-upon convention.

However, the problems ran way deeper than that, and precision in testing terminology was not even in the ballpark of concerns for most engineers. This was well before the insanely great new breed of internal build tools that eventually rolled out starting in January 2008. I’ll describe the full scenario as best I can in a later post, but for now, just understand that the tools then were far less strict and far less powerful than they are today, while more engineers and more code started pouring in at an alarming rate, compounding our already formidable dependency and code health challenges.

The lazy terminology, the lack of internal testing and design knowledge, the early-version BUILD language that didn’t document or enforce interface boundaries and dependency declarations very well, plus the growth of the company and the code base far exceeding the scale at which the existing tools could effectively enable rapid iteration meant that most of the time, testing was often seen as a luxury at best, and a waste of time at worst. We knew we needed better tools, and they were on the way, but we didn’t expect there to be any silver bullet, or cartridge of silver bullets, that would fix what we perceived as a root cause that was at least as important as the tooling issues.

Again paraphrasing Saul Alinsky from Rules For Radicals: If people don’t believe they have the power to change their situation for the better, they won’t even think about trying. But if they have the power to do the right thing, they’ll do it. The tools would supply a lot of power in a couple of years, but we sensed that tools alone weren’t enough: We needed to tell people what the right thing was.

A rose by any other name

The Testing Grouplet eventually realized that the disintegration of the existing terms made it extremely difficult to have a meaningful, productive discussion with regards to ideal testing practice. They no longer carried any concept of scope, or the type of work each test should be doing. This lack of conceptual clarity compounded the problems stemming from the inability of the tools of the time to scale, but couldn’t be cured with better tools alone. We decided to propose new terms that we could precisely define to engage those not yet versed in the ways of developer testing, to more easily introduce them to necessary concepts without the baggage of the old terms.

We identified early on that there were three specific levels we wanted to discuss: “unit”; “integration”; and “system”. Of course, “regression” doesn’t fit here, as it’s an orthogonal concept: Any type of test can be a regression test, because regression tests are intended to reproduce a specific bug at any of these three levels to ensure that a specific change fixes the bug; and the test is retained to ensure the fix isn’t undone by subsequent changes, i.e. that a “regression” doesn’t occur. The regression_test build rule might’ve had that intention in mind, but in practice it was used primarily to identify long-running tests.

However, we were afraid that the names of these levels would not be sufficiently intuitive to prevent them from losing their meaning, and re-introducing the term “unit” seemed a liability, even if we supplied a specific definition. We needed a break from the past, but one that had at least the promise of intuitive staying power. We discussed this for at least a few weeks, if not months, discussing and rejecting several proposed combinations of terms.

Footage from the Testing Grouplet discussions of the time

Though my memory doesn’t serve, Matthew Springer reassures me that he was one who originally suggested the small/medium/large (SML) labels, also now widely referred to as “test sizes”. Everyone involved in the discussions, including representatives from Testing Technology who would work with Build Tools to extend the BUILD language to support systems based on the new terminology, agreed that this combination made intuitive sense, at the same time being different enough to avoid carrying the baggage of the previous terms into future discussions. We defined the labels roughly thus:

Small (Unit): Very fine-grained; exercises low-level logic at the scope of a function or a class; no external resources (except possibly a small data file or two, but preferably no file system dependencies whatsoever); very fast execution on the order of seconds
Medium (Integration): Exercises interaction between discrete components; may have file system dependencies or run multiple processes; runs on the order of minutes
Large (System): Exercises the entire system, end-to-end; used to identify catastrophic errors and performance bottlenecks at scale; may launch or interact with services in a datacenter, preferably with a staging environment to avoid affecting production (especially user traffic, and most especially advertising traffic!); can run on the order of minutes or hours

Some time later, after the new build tools were rolled out, the “enormous” size was added to indicate a test that was somehow “too big” to execute automatically in the shared datacenter execution pool, such that the Chris/Jay Continuous Build system and, later, the Test Automation Platform (TAP) could run “large” tests, but not “enormous” ones. The concept of test “tags” was also introduced, so that tests that weren’t exactly “enormous” but had some dependencies on NFS or other corp infrastructure inaccessible from the datacenters could be marked “local” to be run on the developer’s workstation. And, later, the “flaky” attribute was added to give teams clarity about which tests would pass and fail (seemly) randomly without a change in controlled inputs (implying an uncontrolled input somewhere), helping meet Test Certified Level 1 requirements.

Words, words, words

So now that we’d clarified the concepts of SML, we needed to figure out what to tell people to do with them. We concocted an idea of the right “balance” of each test size: roughly 70% small, 20% medium, and 10% large for the common case. Yes, these numbers essentially were pulled out of a hat. But they were a very useful starting point for discussion and goal-setting, and make intuitive sense:

You want a lot of fast, focused small tests that you can run frequently to ensure that a low-level change is free of negative side-effects, without breaking the “flow” state while developing.
You want a decent-sized layer of medium tests to ensure that contracts are honored at interface and component boundaries.
You want a few large tests to provide confidence that the end-to-end system is hanging together without any pieces falling off.
For a production system or framework, you want a balance of all three, not to explicitly rely on only one size alone; a team developing only pure in-process library code may get away without large tests, but most other projects should most definitely have them.

A few of us met specifically to discuss how to get support from the Testing Technology and Build Tools teams, and all involved became personally invested in accurately defining and actively preserving the semantics of the new terms as the basis for the tools that would one day implement their policies and take advantage of the information implied in each. Soon the necessary BUILD language support was implemented by the Build Tools team; the *_unittest and regression_test BUILD rules were deprecated in favor of the new *_test rules containing the size="..." attribute.

We announced the new scheme to Google engineering, then worked to spread the message more broadly over time. In the Noogler introduction to unit testing class, we introduced a slide that showed the balance represented as a pyramid, with small tests at the bottom, medium in the middle, and large at the top. To one side of the pyramid, there was an arrow pointing down towards the small test layer, indicating that “confidence in a specific change” increases given a large battery of small tests with good code coverage. On the other side, an arrow pointing up towards the large test layer indicated that “confidence in the entire system” increases given a small but inclusive battery of end-to-end tests. Codelabs—hands-on training materials that accompanied the lab associated with the class—were also updated to introduce the terminology. As mentioned in previous posts, eventually Testing on the Toilet relied heavily on the SML concept as a motivation for its advice, and Test Certified relied heavily on as a means of setting concrete, appropriate testing goals.

Sound and fury

Some engineers responded that our tools and test execution frameworks ought to be able to do the classification automatically for us; our response was that even if we had tools powerful enough to make such classifications on an ongoing basis, it wouldn’t absolve us of the responsibility for thinking through what we should be testing, and why. Dan Christian wrote the ratetests tool to aid with automatic test classification, requiring one-off manual executions directly on a developer workstation, but it could only tell you how “big” your test actually was (using a rough set of heuristics at that); it couldn’t tell you how big your test should be. That’s your job as a developer.

Other engineers—including one particular top-tier engineer, as I recall—argued that the assignment of test size should be a matter of taste, with execution time as the most likely determinant as many had relied on with the previous *_unittest and regression_test rules. If a test executes in under five seconds, why shouldn’t it be labeled “small”, even if it launches a handful of processes and depends on a dozen data files? This is a pragmatic view favored by many, and indeed, there’s no strict enforcement by the build tools themselves that “small” tests don’t spawn processes or touch the file system. (It does now, however, enforce proper and explicit declaration of all dependencies.) But our real goal was to get engineers to think about what their test should do first, and how long it takes to execute second. Only when they’ve got a good handle on the first issue should they be particularly concerned about the second. No matter if a test runs in a millisecond, if it interacts with multiple other processes or the file system, it’s still of a fundamentally different nature than a test that exercises pure algorithms or business logic.

To hold as ’twere the mirror up to nature

All of this wordplay was designed to get people to observe what their tests were actually doing today (if they had them at all) and consider the possibility of what they could or should be doing. Explictly added to the message was the idea that designing for testability was an essential ingredient in the development process. It was what would allow you to cut unnecessary dependencies, or avoid adding them in the first place. It would help you make existing tests faster, or write small, focused, fast, stable (i.e. non-flaky) tests to begin with. And wouldn’t you know it, the code would often just look better and make more sense in the long run, since by designing for testability, you’d have to increase cohesion and reduce coupling, making the whole ball of yarn easier to understand.

Of course, anybody can write bad FORTRAN in any language. Some of us (including me, I admit) could at times get so focused on testability above all else that it wound up adding more complexity rather than eliminating it. But that’s the beauty of an open development ecosystem based on compulsory, documented code reviews; folks could benefit from the feedback of their peers, immediately or in the future, who could point out the error of their ways. (Some folks get the same kick from pair programming; I just happened never to do much of it.)

Back to the point: In order for a team to achieve Test Certified Level 1, they needed to: label their tests according to SML; tag their flaky tests; set up a (Chris/Jay) continuous build system; register for a coverage “bundle” (managed by the Unit Test Framework, a precursor to TAP) which could identify the amount of coverage per test size; and set up a relatively fast-running, high-coverage suite of “smoke” tests. The idea behind the “smoke” tests in particular was to have folks think about which tests gave their team the most value for the time they took to run—the expectation being that this would mostly consist of “small” and possibly “medium” tests—and to have everyone run those more frequently, both before submitting changes and as the first phase of the team’s continuous build (the Chris/Jay system would run a set of “stable” tests before the remaining “golden” tests). That way, very slow or expensive tests wouldn’t drag out the feedback loop out of proportion to their value, increasing the perceived utility of the tests in the “smoke” set while forcing the team to come to terms with which tests fell into each category. This was the mirror, the measurement whereby engineers could reflect on what they hath wrought.

To achieve Test Certified Level 2, teams had to achieve a target balance of test sizes, with some initial code coverage requirements pertaining to test sizes as well. This is where the change in design sense, the change in thinking was forced to take hold. As mentioned before, even if a tool could rate your tests according to size, it couldn’t help you achieve the appropriate balance of test sizes and coverage for your project. Ultimately human intelligence was required to decide whether there was enough small test coverage, whether the large test hit all the bases it needed to, what reasonable coverage levels were for each level of test given the application structure, how to cut dependencies and refactor code to improve test speed or stability, etc.

Must like a whore unpack my heart with words

Truth be told, for a long time, despite SML’s success in providing a new foundation for meaningful discussion and goal-setting in combination with Testing on the Toilet and Test Certified, it was still a long, slow, uphill climb for a while to convince people to improve their design and testing practices, the reason being the perfectly natural path of least resistance: Until the tools enforced the constraints necessary to support the concepts more reliably, and with acceptable speed, it was easier for most to sympathize with the new goals than to actually attempt to reach them. People claimed that they “didn’t have time to test”. The Test Mercenaries program was launched to provide engineers to teams that needed help climbing the TC ladder, in the hopes of expediting the TC adoption process. It helped us get more in-the-trenches experience and feedback to improve the messaging, design concepts, and the tools, but ultimately didn’t scale in and of itself.

However, by starting the conversation early and not letting up, when the tools finally did make the quantum leap towards the build system Google knows and loves today, people were ready. They knew what the right thing was, and suddenly they had the power to do it, and they were far more eager to get it done. The feedback loop shrank, and the perceived value increased. The Test Certified ladder exploded. And in the end, TAP was born—prioritizing small tests ahead of medium, medium ahead of large, and taking advantage of the hermetic distributed execution environment and tighter feedback loops to report breakages in all affected projects with alarming speed.

Without SML as a core part of the vocabulary, forcing discussion of testing concepts using fresh new terms and sidestepping both confusion and debate over the old, none of the other Testing Grouplet programs likely would have the impact they had, at least not as quickly, even after the introduction of the super new tools. Just as Testing on the Toilet broke down geographical barriers and Test Certified broke down psychological barriers, SML broke down language and conceptual barriers.