This is the third of five1 posts in my “whaling” series about the high-level conceptual and cultural challenges the Testing Grouplet and its allies faced, and the knowledge and tools that eventually spread throughout Google engineering that removed the infamous “I don’t have time to test” excuse. This post covers the basics of how automated tests should—and should not—be written.
The first post in this series focused on high-level cultural challenges to the adoption of automated developer testing that emerged as a result of day-to-day development reality. The second post in this series focused on the fundamental object-oriented programming issues which formed the core of most of Google’s testability challenges—and solutions. The next post will describe the collection of processes employed by Google for ensuring software quality—including, but not limited to, automated testing. The final post will discuss the specific tools the the Testing Grouplet, Testing Tech, Build Tools and others developed to improve testing efficiency and effectiveness.
Bits of this post and the next two will incorporate comments I made on a Google+ post by Isaac Truett soliciting advice on improving code quality. Thanks to Isaac and Rob Konigsberg for engaging me in that conversation.
Fact-checks and embellishments from Googlers present and past welcome via email or Google+ comments, as always.
The concept of “quality” is core to Zen and the Art of Motorcycle Maintenance. Wrestling with this concept literally drives the protagonist insane. (Sorry for the spoiler for those intending to read it, but haven’t as of yet.) “Quality” is the kind of, um, quality that people inherently value and strive to produce, yet struggle to definitively measure or reliably produce using a specific methodology.
With that qualification, here’s my working definition of “code quality”: The ability to easily read, understand, and modify code with speed, focus, and confidence. It has nothing to do with the quality of the application that the code represents; history is full of awesome apps built from codebases that are an ungodly horror to lay eyes upon, and I’m sure there are just as many beautifully-written failures. But a higher degree of code quality, over time, tends to precipitate higher-quality applications and services, especially as a code base, team, and/or company grows.
Automated testing is a practice that strongly correlates with code quality. Beyond a certain scale of team size and software complexity, automated testing becomes necessary to ensure a high-level of code quality in the long-term, though it is certainly not sufficient. At the lowest levels of software development, automated testing helps shake out silly bugs; at higher levels it helps us reason through and verify interfaces and contracts, encouraging high logical cohesion of related concepts (i.e. classes, subsystems, and services with clear, though limited responsibilities) and loose coupling between discrete software components (i.e. as few dependencies between classes, subsystems, and services as possible); and when a healthy battery of tests are in place with adequate tool support to produce and run them quickly and easily, it becomes infinitely easier to add, remove, and fix features with focus and confidence rather than abject terror.
Hence, the conviction of the Testing Grouplet—plus Testing Technology, Build Tools, Test Engineering, and Engineering Productivity—to add automated testing to every Google engineer’s toolbox, in addition to the already-established and invaluable practice of code reviews, and in addition to whatever (increasingly scarce) manual testing resources were available.
How to Write Tests
On top of designing for testability, discussed in the previous post, there’s also the matter of designing one’s tests. A lot of the same concepts apply, in terms of separating responsibilities into small, coherent components, to enhance readability and ease future maintenance and the development of new features. Automated tests are as diverse as the features they validate in terms of their requirements, dependencies, and structure, but again, there are a few general guidelines that tend to lead to better tests regardless of the type of code under test.
Those of us in the Testing Grouplet had to teach people how to write tests, in addition to teaching them how to make their code testable in the first place. The testing guide for the Google Web Server team has lots of good advice; not surprising, since Bharat Mediratta, the Testing Grouplet founder, was-and-is GWS’s tech lead. Several bits from that guide stuck out and guided a lot of my own test development, and are among the key testing guidelines listed below. Bear in mind that this is not an exhaustive set of guidelines, but certainly some of the most important.
Use abstract interfaces and dependency injection to replace heavyweight or uncontrollable production dependencies with lightweight and controllable fake/mock/stub/dummy implementations in tests: No surprises here, as I explained in the previous post. Smaller, more focused tests with fewer dependencies and faster execution times == Win! For a more precise differentiation between the terms “fake”, “mock”, “stub”, and “dummy”, see the Testing on the Toilet article on stubs. In my final “whaling” post, I’ll talk more about tools and frameworks to make dependency injection easier to apply in both production code and in tests.
Write appropriately-sized tests: My earlier post on the Small/Medium/Large test size schema explains the different sizes of tests, i.e. the different scopes and functions of tests, and their proper application. The Small/Medium/Large model doesn’t only help when thinking about what tests to write, but also helps thinking about the structure of the production code with regard to how the different levels of abstraction are built up and designed.
Use small, focused, clearly named test cases—avoid “and” cases/names: Good test functions will focus on a particular behavior or result based on a particular context, rather than running through several results or situations all at once. Often, though not always, if the name of a test function contains the word “and”, either the test or the code under test is doing too much. For example, consider a test case named
ParseDocumentAndWriteToBigtable. What can you infer from a test result that says “
ParseDocumentAndWriteToBigtable: FAILED”? Did the parser fail for some reason? The write to the Bigtable? Both? And if only the parser failed, did the write to Bigtable still happen?
These questions may not be difficult to investigate and answer, but imagine having the following set of functions instead:
ParseDocument ParseDocumentErrorBadEncoding ParseDocumentErrorBadUrl ParseDocumentErrorTitleMissing WriteParsedDocumentToBigtable DocumentWithParseErrorNotWrittenToBigtable
ParseDocumentErrorBadUrl fails, and the rest pass. It’s clear that the URL-parsing code is broken somehow, but the rest of the parsing and error detection works—and unsuccessfully-parsed documents are still not written to the Bigtable. So without even looking at any test code, you already know that what you just changed had an impact on URL parsing, and nothing more. For a little more typing up-front, that’s tons of Time and Money and flow saved over the long-run.
Use fixture and/or helper classes to provide common data and setup, with clearly-named helper functions: All modern testing frameworks allow you to define a test “fixture” class, which contains common initialization, setup, and utility behavior, and which is associated with a particular set of test functions/methods to make writing those test functions easier. A new test fixture is created for each test function, and is destroyed when the test function completes, ensuring proper setup for each test and that each test function is isolated from all others. Such classes can help make tests far more readable by reducing redundant details to just one or a very few lines of code, such as (in C++, assuming Google Test style):
Of course, there’s nothing that requires that helper code exist only in the fixture class for a specific test. You’re free to write a completely independent helper class, which can then be reused between separate, unrelated tests. For complex systems, these classes can be invaluable in helping to write readable, thorough tests with ease.
Tests should be totally isolated from one another and produce the same results regardless of the order in which they are executed: Tests that produce different results based on the order in which they are run imply a dependency between test cases, making the context of the test cases a bit ambiguous when the context needs to be perfectly clear to be useful. As mentioned above, modern automated testing frameworks use test fixture classes to ensure that each test case is set up the same way before execution, and to ensure that the results of one test case are not influencing another.
Repetition is OK: When writing production code, the Don’t Repeate Yourself (DRY) principle dictates that there should always be exactly one place in a system where a particular piece of data or a logical concept is represented. However, when it comes to writing test code, a lot of carefully-crafted repetition can actually enhance readability. Test functions benefit from providing as much context as possible to understand what the test is doing from looking only at the code for the test function itself. If that means that almost every test has a call to
SetupDocument() with the exact same document text, and has many of the same values for many of the same assertions, so be it. When the test breaks for somebody else, or for yourself six months in the future, your colleagues will thank you, or you’ll thank yourself, for making the test a bit more self-contained so that understanding the failure and fixing the problem doesn’t also involve grasping the full complexity of the test fixture as well, or double-checking to make sure there were no other side-effects.
There’s no hard and fast rule as to what setup should be explicitly done in each test function, as opposed to implicitly handled by the constructor for the test fixture created for each test. But there’s no hard and fast rule that says all setup should be done one way or the other, either. It’s up to the author of the test to strike the right balance for the sake of readability, as well as easy comprehension and diagnosis of test failures.
As an example, imagine this test function (using the same
DocumentParserTest fixture as above) fails three years after it was written:
The only assertion that fails is
EXPECT_EQ("Ceci n'est pas une URL", document_.canonicalized_url()). It seems
document_.canonicalized_url() is empty. Why? Ah, it’s because someone from the team that maintains the URL canonicalization code made a change to leave the result variable empty on error, rather than copying the original URL value regardless of the outcome.2 You can now either choose to discuss the issue with the other developer—or development team as the case may be—or choose to accept their change and update the test accordingly. Either way, the necessary communication and/or decisions take place almost immediately. You don’t have to waste time making sure you understood how the parse was setup and executed before diagnosing the problem, because the relevant setup and execution is completely contained in the test function itself. You don’t have to waste time checking for other side-effects, because remaining assertions didn’t fire after the code or the test was updated.3
Avoid data-driven tests: A controversial topic, this refers to tests that use a single test function to run through a table of data representing different test cases, as opposed to having a separate test function for each test case. Data-driven tests can be problematic in that if the format of the test data changes, all of the test data needs to be updated, not just the data relevant to an individual test case—making the test data structures difficult to understand and prone to compile failures, creating an excess maintenance burden. The expected results are no longer explicitly specified in each assertion, adding to the mental overhead of diagnosing failures—time and energy that could be spent on fixing the problem instead. Also, while each element of test data can include a description of the test case it exercises, testing frameworks are better suited to reporting failures in a standard format based on the name of each test function/test case.
My suspicion is that writing data-driven tests is motivated by a desire to adhere to the Don’t Repeat Yourself principle, where otherwise identical (or nearly-identical) test code is condensed into a single function. As mentioned above, repetition can actually help comprehension of a test failure by providing sufficient context both from the name of the test and by the code in the test itself. Grasping the relevant context in a data-driven test is not as straightforward.
Imagine the tests above were written as:
It may not look too bad now, and seems to take less code, which is considered a Good Thing—but imagine deciding later to make the title and URL part of the
TestData structure, needed by tests exercising the relevant code. You would have to change every member of
data, even those that represent test cases that don’t focus on the title and URL at all. In the earlier example, you just add the new setup values to
SetupDocument() and override what you need after calling it in each test. The differences in setup between test cases are easier to see that way.
In this data-driven example, you have to remember what
false really mean, in addition to
-1, and what they mean for each member of
data. The values used in each assertion are not made explicit in the assertions themselves, and the author must take care to provide
data[i].test_case in the error output. Sure, the function name and expected values are reported in the output, but remembering them becomes cognitive overhead that the programmer diagnosing the failure must maintain, and every little bit of overhead adds just that much friction to thinking through the problem.4 In the earlier example, each of these expected result values appears in context, in the same assertion that exercises the relevant code or checks the relevant result. The test framework handles reporting the name of the test case that failed based on the name of the test function in a structured, standard fashion.
And above all, remember that this is a much-simplified example: There are a lot of aspects to document parsing, and a lot of failure modes, that are not contained in this example. How would the
TestData structure look in that case? How would the
data table look? How many
EXPECT_EQ statements would be lumped together? How would easy would it be to diagnose a failure, or add a test for a new behavior?
Avoid golden file tests: As explained in the previous post, so-called “golden files” are logs of output from critical points in a program’s logic that can be compared between test runs. They are usually noisy, brittle, and unisolated, and shouldn’t be relied upon as a primary means of automated testing. That said, some teams may find value in maintaining one or two golden file tests, such as comparing large volumes of processed output in a large test, or using a catch-all example file to identify when individual features interfere with one another, prompting the creation of a more focused test to recreate the bug and act as a regression test.
Here are just a few small, personal examples of upholding and intentionally violating the above guidelines to which I can point handily; I don’t actually include any examples of code using mock objects. I’m sure there are far superior examples out there, and I welcome any pointers to such.
Avoiding data-driven tests: When Tony Aiuto began using and contributing to Pyfakefs, he wrote some tests that exercised every fake system call and compared it to the real one. Each test function exercised one system call by running through a table of files that would be created by both the fake file system and the real file system. In other words, each test for each system call was data-driven.
In the code review, I challenged him to reverse the structure: Each test case should represent a type of file system object and directory structure, and there should be a function that runs all of the system calls and compares their results for a given file. The question is no longer “What kinds of files (input data) break a specific system call?”, but “What system calls (behaviors) are broken for a specific type of file?” The test ended up being much more readable and useful, and you can see it here: fake_filesystem_vs_real_test.py Yes, the fixture implements some complex assertions, but the test cases themselves are straightforward to understand, and the errors reported by the assertions are lovely and very helpful.
Resorting to data-driven(-ish) tests: The Go FAQ section on testing suggests that “If the amount of extra code required to write good errors seems repetitive and overwhelming, the test might work better if table-driven, iterating over a list of inputs and outputs defined in a data structure.” Despite my love for Go, this does make me shudder a little, as does the example cited in the same paragraph, fmt_test.go. However, I do see the point; if you have a pure function with a few well-defined inputs and outputs that are unlikely to ever change, it can seem like overkill to wrap each one in its own test function.
But, I propose as an alternative the style I adopted in my recent, modest algorithm_test.go. The tests are somewhat data-driven, but rather than running an array of literals through a single function, I wrote a handful of helper functions (
check*()) which document the intent of the individual test case in-place. Note that these helpers make use of my own
FileAndLine() so that the exact line of a failure is reported—the exact same line where a piece of test data is defined. For slobs like me who like to scan for line numbers in the error output and jump there immediately to see the context of the failure, this is much more helpful than having to search through an array of test data for the offender.
Resorting to golden files: On my last team in websearch, we had a ton of small and medium tests that used Google Mock, but relied on a pair of golden file tests that ran over a large set of documents to detect changes in indexing signals made by teams we relied upon but did not coordinate with closely. The tests broke with some frequency, maybe every day or two, but they weren’t flaky, and were fixed quickly after communicating with the other teams to decide on the impact the signal changes had on our system.
On a much smaller scale, see
TestGoldenFile in my updaters_test.go. There’s plenty of more focused tests in there, but having a single large example that can help guarantee that all the features together continue to work is a lot less annoying that writing a combinatorial number of smaller test cases. This test actually helped me find a bug in my
update-post program, for which I wrote the much smaller
Test FootnoteUpdater Extra Paragraphs In An Existing Footnote Are Preserved5 after the fact.
My old partner-in-crime David Plass and I had this exchange recently, regarding our small extracurricular projects. From David:
This past week I was on vacation and worked on some non-Google code (I have copyright ownership of it. Woo hoo!) It’s a web app that [DOES STUFF THAT HAS BEEN REDACTED TO PROTECT THE INNOCENT AND THEIR VICTIMS]. It’s Python/App Engine (flames to
I wrote it from the beginning with the logical separation of presentation from logic/storage in mind, and it worked out pretty well… but I still wasn’t writing tests. Until I started having bugs. So I started writing tests of the “logic” layer – because after all, that’s why I separated it in the first place! Now I’m addicted. I rarely make a change to the logic module without writing a test (not always first, but at least at the same time.) And I’m addicted to running the tests – even after I don’t know if I made a change.
I’ve also instituted some (mental) rules:
* run tests before commit (it’s
svn; flames to
* write tests for new “logic” code
* as time allows, every commit should include at least one new test, even if it’s not related to the commit (cleanup)
The “logic” code also includes an authorization layer, and testing this by hand (i.e.., in browsers) is very tedious. Tests to the rescue!
And my reply:
See, that sounds very similar to my own process… That Go program I wrote, at first I was just running an entire test file through it, just like the Ruby version. Then I locked down specific features with tests.
Then I added a feature, and wrote the tests first ’cause I knew exactly how I wanted it.
Then I thought about the implementation a little, and ended up making a new change—refactoring—without any new tests.
Then I realized it had a bug, wrote a test to demonstrate it, and then got it to pass.
Then I started running the big golden test file through, and realized I had a pretty serious bug, and wrote a failing test reproducing it that I finally got to pass.
Then I rolled up my sleeves and put in the golden file test, for good measure.
Not earth-shattering, not rocket science, and neither of us are talking about Google code. But, given that David and I both personally did a lot to influence Google engineering culture towards widespread adoption of automated developer testing, I think it may be interesting to note that neither of us are particularly dogmatic in our approach, but we still find value in the practice in our personal lives. In fact, some may accuse us of being practical: Once we had a good lock on our core features and interfaces, then we used automated testing to ensure new stuff didn’t break old stuff. At no point were we slowed down; in fact, applying automated testing at the right time accelerated our development once we knew what we wanted to do.
How Not to Write Tests
Thanks to Andrew Trenk and, as mentioned in my previous post, Dhanji Prasanna, I’m well-reminded to explain the flip side, or how not to write a test. Despite the best of intentions, it is possible to test not wisely but too well.
Watch out for warning signs: Andrew reminded me of the old Test Mercenaries’ Code Reviewers’ Guide, publicly available thanks to Miško Hevery. Miško and friends even had these printed and laminated to hand out to folks, to leave out on people’s desks where they’d be handy and possibly prompt conversation. It’s more about how not to write the production code itself, more so than how not to write tests, but it still falls under the “How Not To…” rubric. It gives good concrete advice based on the principles I outlined in Object-Oriented Programming Revisited. Good stuff there; check it out.
Don’t overdesign for testability: Yes, it’s possible to overcomplicate a design for the sake of testability, producing interfaces so narrow and implementations so tiny as to be absurd. This is usually the case with code that directly manipulates external resources (as opposed to pure algorithms or business logic); medium and large tests may be the ticket for exercising such code.
Pay attention to coverage, but don’t artificially boost it: Don’t worry about reaching a coverage goal for its own sake; use it as a signal for overall testing hygiene, but don’t go through contortions that bump up coverage if it makes the code more difficult to reason about.
Think before testing: If you don’t have the necessary background knowledge to think through a problem absent testing, testing won’t necessarily help you write a correct piece of code. Tests are a tool to clarify and verify thinking, not replace it. That was the takeaway from the Norvig vs. Jeffries Sudoku showdown.
Don’t overuse/abuse mock objects: This is related to all of the above. Andrew Trenk started a conversation with me thus:
One problem I’m still seeing at Google is that although people are writing tests, very often the tests are unreadable, unmaintainable, and don’t test the right things. I think the biggest culprit is improper usage of mock objects, so I’m trying to write up a guide on how to use mock objects properly. Do you have any thoughts on this?
Such problems might have something to do with not having stable, coherent interfaces and contracts, and possibly even established programming idioms within particular domains—as opposed to, for example, websearch, where there are many well-known idioms for interacting with remote services and for event-driven processing. It may require prototyping and debating designs with teammates, thinking through the problem up-front, rather than slapping together a class and trying to test it first.
In my day, people weren’t writing smaller classes and using dependency injection so much as now; it’s possible some folks are taking it past the point of diminishing returns, separating details and responsibilities that really shouldn’t be separated. I’m mildly surprised when I hear people say that they end up stuck maintaining a bunch of mocks along with the production code; I’m hard pressed to recall using a specific mock in more than one test. Fakes or lightweight implementations, yes, but not mocks created using a framework. My suspicion would be that the class being mocked should be used to compose just one other class—meaning only one test and one mock to maintain when the class changes—and that it’s getting passed around directly to too many places. Or, maybe, the class shouldn’t be mocked at all.
However, I’d like to see specific examples in context before I could make solid assertions out of these hypotheses. It’s too easy to play armchair quarterback without concrete examples.
Yes, I think this is a big part of the problem. One pattern I see is that people use dependency injection as a way to mock out all of a class’s dependencies. And by “mock” I don’t just mean a reusable fake, I mean an object created by a mocking framework directly in the class. So you end up with a class that’s not really testing anything useful since all it’s dependencies are mocked out, and is hard to maintain since it’s hardcoding implementation details in its mocks (so when you make a change to an implementation detail you have to go around to lots of test updating each mock). Then there’s also the issue of tests being harder to understand due to all the extra logic you need to set up your mocks. And a lack of understanding between state testing and behavior testing makes things even worse!
Hand-rolled “mocks” like that aren’t mocks, really. If an implementation detail change breaks a mock’s behavior, it sounds like the right behaviors haven’t been factored together—rather than factored out—behind a relatively stable interface that shields calling code from unnecessary detail. (See the Law of Demeter.) And it shouldn’t take that much setup to set up a mock. That ought to be a sign that You’re Doing It Wrong.
State vs. behavior testing—very nice way to put it! And very true. (And thanks to Andrew for pointing me to Martin Fowler’s article, with which I was unfamiliar. I’ve never read that much of Fowler’s stuff, though I’m sure he’s influenced my thinking without my awareness anyway.)
I think part of the problem is that many people think of unit testing as testing only a single class at a time (I thought that too for a while), but I think what it really means is test the smallest interface possible. So if you have a class that talks to a database, you can’t just mock out the database in the name of writing a small test, you either need to start up a local database, or have a tested fake to use in its place. If you have business logic in the same class that talks to a database, you would need to refactor the code to make the business logic independent of the database code, so you can then test the business logic without talking to the database.
Here Andrew describes the perfect case for medium tests: If you’re testing the code that talks to the database, you’re gonna need to run the database—or a well-maintained stand-in, like a “Mock” Bigtable or the Google-internal remote service mocker framework—to test that code. It’s not turtles all the way down.
Again from Andrew Trenk:
I think that whole 70/20/10 thing was part of the problem, since it made it seem like a medium test was something you should only write occasionally, and only as sort of an integration test to make sure all the classes that you’ve tested with small tests are working together (I remember you didn’t like the 70/20/10 idea, maybe this is why).
I did like 70/20/10, but I hated fighting about whether it was a strict rule or a guideline. I always asserted that people should think about what made sense as the right balance for their project, and 70/20/10 was a starting point based on intuition—a back-of-the-envelope figure reflecting the more-tests-for-lower-abstraction-levels, fewer-tests-for-higher-abstraction-levels idea we were promoting to get engineers thinking of writing more smaller and fewer larger tests. I guess we didn’t make it very clear, in part because some people fought hard for a literal interpretation of the figure.
That, and even Googlers are often susceptible to over-enthusiastic under-thinking when they finally become convinced of something. Oh, the peril of promoting a public policy: Such a danger that people will agree with you, and act on that agreement, without actually thinking about what you’re suggesting, with all its qualifications, consequences, and shades of subtlety!6
1Yes, of course you knew it would be five by now
2URL canonicalization involves collapsing a web page’s original URL into a “canonical” form to allow for consistency in indexing. For instance, you can use both
http://www.google.com/ to reach the Google homepage, but searching for “google” produces a result containing
http://www.google.com/. There are pros and cons to either clearing a result parameter on error or to leaving it alone; I can’t remember what the actual Google canonicalization code does anymore. As an example for use in this discussion, though, the important part isn’t what the URL canonicalization error contract should be, but that small, focused, easily-understood tests are in place to alert the relevant developers to a new mismatch in expectations, preventing bugs from remaining in the code for too long.
3A nice feature of Google Test is that
EXPECT assertions do not cause the entire test function to stop right away, as opposed to
ASSERT assertions, so one can gain more information from a single run of the test in case of an assertion failure. I’m not aware of other test frameworks in other languages that allow this—but I’ve been semi-retired for a bit, so perhaps there are by now.
4On a related note, please stop using
bool values as function parameters. Seriously, please just stop, unless your language supports keyword-style parameter specification—in which case, please always use the keyword-style for boolean parameters. Bogdan Stanescu and I had a minor debate once about boolean function parameters, in which he argued that it’s easy to look up what
false refers to in the file where the function is defined. I have nothing but the utmost respect for Bogdan, but I heartily disagree with him on this subject. Though it takes a little extra typing to define in C++, an
enum value with a meaningful name vastly improves clarity, readability, and maintainability with zero performance impact. As someone who comes along to understand a piece of code I didn’t write, or a piece of code I wrote months or years ago, I’d much rather deal with code containing a function call that looks something like:
ParseDocument(&document, kNewChecksumParser, kIgnoreEmbeddedContent, kLogErrorsToStdout)
than to see:
ParseDocument(&document, true, false, false)
have to dive into other files to figure out what
false mean in one position vs. another, and then keep that in my head as I try to understand the larger context of what the code’s doing. Plus, as a bonus, a single
enum can represent more options than just
false, if you need that kind of flexibility.
5The actual name in the code has no spaces, but it looked ridiculous in this post given the formatting.
6At the end of the TAP Fixit, there was an engineer in Krakow giving the TAP Fixit volunteer (and Statistician) there, Michał Kaczmarek, grief because his project’s build had fundamental incompatibilities with TAP, and was now screwed because he got rid of his Chris/Jay continuous build system because the TAP team gave him and everybody else “really bad advice”. Michał tried to appease this guy, but bubbled up to me at some point, and I pointed out in no uncertain terms, though in a clear tone, that the TAP Fixit folks had bent over backward to saturate the environment for months regarding TAP’s capabilities and what warranted holding onto a C/J for now. Plus, he always could’ve asked Michał or someone else on the tap-fixit mailing list. That seemed to shut him up.