Call me Ishmael

The high-level cultural challenges to the adoption of automated developer testing at Google which the Testing Grouplet worked to overcome

10 Aug 2012 - Barcelona
Tags: Build Tools, Fixits, Google, TAP, Test Certified, Testing Grouplet, Testing Tech, TotT, grouplets, technical, whaling

And now, I’ma get Moby-Dick on your ass. By that I mean, it’s whaling time.

And by that I mean that, to quote Wikipedia, Moby-Dick contains large sections—most of them narrated by Ishmael—that seemingly have nothing to do with the plot but describe aspects of the whaling business. Up to this point in my recounting of Google Grouplet history, I’ve highlighted various movements and events that shifted Google development towards a culture of automated developer testing. I’m starting to write a small set of posts talking about the Fixits I organized or was otherwise involved with, which propelled the mission forward by leaps and bounds each time, and I found myself compulsively footnoting this and that detail to explain why we were trying to compel developers to perform certain tasks. All the footnoting seemed to get a bit absurd, so now I’ve set myself to the task to outline more of the conceptual framework those of us in the Testing Grouplet were operating within.

Plus, when I talk about changing the culture to adopt automated developer testing, I don’t think I’ve filled in the blanks regarding exactly what that looked like from the point of view of practical, day-to-day coding. An earlier post, Coding and Testing at Google, 2006 vs. 2011, gives some of that perspective, but stops short of diving into the source code files themselves to point out the kind of problems we saw—and solved.

First, in this post, I’ll point out the high-level, cultural challenges and the annoying details of life-in-the-trenches that helped produce them, and how Testing Grouplet-related efforts made a difference in overcoming each issue. In the next one, I’ll delve into object-oriented programming arcana to explain the classes (wink, wink, nudge, nudge) of problems we encountered in the code itself and worked to solve through spreading knowledge of design techniques supporting testability. After that, I’ll point out a few of the specific tools we cooked up that made automated testing easier and better.

Caveat: I don’t claim that I’ll provide an absolutely exhaustive view of the relevant conceptual environment here, and my grasp on it is limited to that of a C++ specialist, primarily, though I’ll try to give the Java side a nod as best I can. Javascript people, in advance, I’m really, really sorry; I can say nothing with authority about your situation, other than that Javascript testing poses its own challenges with which I never really got my fingers dirty.

Fact-checks and embellishments from Googlers present and past welcome via email, Google+ comments, or separate blog responses, as always.

“I don’t have time to test”
Complexity
Flakiness
Tools
Launching, Recognition, and Promotions
Legacy Code

“I don’t have time to test”

When it comes down to it, what is the reason every person on the planet chooses, at every moment, to pursue one activity over another? Time. To a person, we choose to spend our time based on what we perceive to be the potential return on the investment of that time. However, it was only after the Testing Grouplet was able to get questions on the yearly development survey prodding at the reasons that people don’t write automated tests that this fundamental principle of human nature became clear enough for us to act on it.

Sure, some people would respond along the lines of “tests are useless” or “my code is too hard to test” or “my manager won’t let me write tests” or whatever else it was we listed as potential responses. But by far, rather than coming out and saying such negative things directly about testing, people were saying that it was time holding them back more than anything. Maybe it was just a polite way of saying one of those other things, but the distinction is important: If you say you don’t test because you don’t have time, it implies that you probably would test if you did have the time.

So what does it mean to have the time to test? Basically, it means that one has the knowledge and tools necessary to write and run tests quickly enough to provide confidence in new changes without slowing down the development process as a whole. Of course writing and executing tests will take some time no matter what, time that could otherwise be spent writing fresh new lines of code. But if a body of tests, over time, enable one to write fresh new code without an all-consuming fear of breaking the existing code, and if such tests actually do fail—quickly—when new changes introduce genuine problems, one eventually begins to appreciate the fact that good automated testing discipline actually does go a long ways towards keeping the development process moving forward at high speed. You as an individual developer can stay in the flow state for longer, and the team as a whole is productively adding and integrating features rather than fighting bugs.

There is the argument that, in the very early days of a project or startup company, too much focus on testing can harm creativity and development velocity. I actually buy that argument to an extent, but with the caveat that as a project moves from prototype to demo to production, at some point there will be a core of functionality that the project or company depends on, and if there’s no solid battery of tests to ensure its integrity, fear will begin slamming on the brakes, especially as new people are added to the project—people who lack the context of those who were present ever since the product’s conception. And suffice it to say, while Google had gone very far and been very successful with its super-smart developers and culture of rigourous code review and coding standards, it was rocketing past that threshold of fear and friction by 2005, as I described in Coding and Testing at Google, 2006 vs. 2011.

Besides, Googlers at the time were largely ignorant of how to write automated tests in the first place. For the computer science graduates out there: How many of you had to write tests for your projects? How many of you had professors that talked about automated testing concepts, development strategies, and applicable tools? This may have changed within computer science curricula in recent years, but certainly at the time the Testing Grouplet was on its mission, the vast, vast majority of Googlers, to say nothing of programmers at large, were largely left to discover the practices and tools of automated testing on their own. These developers might know a thing or two about algorithms and data structures and even distributed systems and machine learning—clearly smart and capable of acquiring and integrating advanced computing concepts into their daily work. But that didn’t change the fact that most of them had no exposure to automated testing and designing code for testability—and given Google’s level of success, many of them felt no motivation to seek exposure independently.

Again, to paraphrase Saul Alinsky from Rules For Radicals: If people believe they have the power to do the right thing, to change their situation for the better, they’ll do it. Otherwise they won’t. In this case, the lack of time was indicative of the lack of power to do the “right thing”—or relative ignorance of what the “right thing” was. Once the Testing Grouplet knew that the perceived lack of time was the primary motivation for developers not to adopt automated testing, we worked ways to give developers the power—both in terms of knowledge, tools, and eventually, social leverage—to overcome the specific challenges that seemed to demand too much of their time to overcome in the present environment.

Complexity

Writing software is hard, in much the same way that mastery of any skill or language is hard. Getting a small program that runs only on your own computer to run correctly can sometimes prove a humbling challenge.

Now, for those with no such experience, imagine writing software that must communicate with other running programs, either on the same machine or somewhere on the other side of the world. Imagine that it needs to process and manage enormous amounts of data, while responding to thousands of individual requests for services/responses at the same time. Imagine that there are several different requests or “events” that the program is managing simultaneously, either by queuing them up to handle one request while another is waiting, or by parts of the same program running on multiple processors at the same time, or a combination of the two approaches. Imagine hundreds, of thousands of copies of the same program running on different machines, in different datacenters, communicating and coordinating with thousands of copies of other programs running on other machines in the same or other datacenters. Imagine making sure that, with thousands of programs running on thousands of machines all over the world, processing millions, billions of requests over a nearly-unimaginably large data store, that every single human user who makes a request receives a correct, successful response within milliseconds—or that the request fails gracefully to ensure a positive user experience without data corruption or loss.

Now imagine getting all of that right, as new features and improvements are added to all of these products and the systems that support them all the time, as both new programmers are brought on-board and new users come online. Imagine you’re a programmer on such a system—especially a new programmer—and you have to change something. Or imagine being an experienced programmer on the team, and asking a new hire to make a change to production code.

Given that it’s hard enough to write software in the first place, trying to figure out how to test it seems to only add more weight to the challenge, especially if one has not been trained to hold one particular thought in mind as one goes about designing and implementing a piece of code: “How am I going to test this?” The Testing Grouplet wanted Google developers to write more tests, but we didn’t realize at first that many developers didn’t know how to do so, and even those who did had very rudimentary tools at their disposal that seemed to introduce more friction and resistance than they were worth.

As a necessary first step, even before introducing improved tools, we had to find a way to change the way developers thought about automated software testing and the necessary design considerations, from the interfaces of functions of classes, to the contracts and boundaries between separate libraries and subsystems, to the architecture of each product application and its operating environment. To that end, we introduced the Small, Medium, Large terminology as a starting point to generate awareness of the different scopes of testing as it applied to different scopes of software design. Automated testing was not a one-size-fits-all affair, and had specific implications for every level of software development, and Small/Medium/Large was a push to drive that into everyone’s consciousness.

Small/Medium/Large also became a core concept in the Testing Groupet’s Test Certified program, which was designed to ease the burden of teams wishing to improve their testing by giving them a set of discrete tasks to set them on the right path. When some teams who wished to adopt Test Certified practices found their challenges too great to overcome given their current level of expertise, the Test Mercenaries were formed to provide hands-on help. Testing on the Toilet, of course, helped to popularize these programs, as well as the techniques and tools that emerged from them, as well as other corners of the company, to support improved testability.

Flakiness

A “flaky” test intermittently passes or fails given no change in controlled input data, implying at least one uncontrolled input somewhere in the test. The fix involves eliminating the dependency on the uncontrolled input via a change to the code under test, or a change to the test that brings the input under control (which may also require a corresponding code change, though not always). Flaky tests are particularly harmful in that their results are often not taken seriously, even when they begin to produce consistent failures that point to a real problem, like the boy who cried wolf.

Plus, the presence of just one flaky test may mask the results of failing non-flaky tests until someone finally notices. All it takes is one flaky test to desensitize an entire team to the passing-or-failing status of an entire test suite. The perceived value of automated testing in general plummets to zero, while the cost of identifying and fixing the source of legitimate failures grows ever larger.

Flakiness is often the consequence of trying to write a test for a complex piece of software, as described above, with inadequate design and testing knowledge or tools. The solution for fixing test flakiness is tantamount to managing the complexity of the software under test. A big part of the Testing Grouplet’s challenge wasn’t just getting people to write tests for complex software in the first place, but for helping get existing flaky tests back into a stable, passing state. This was important not just for testing a particular piece of software, but making inroads towards reversing the perception that automated tests are of dubious value, and developing patterns and strategies for managing software complexity across projects of similar functions and architectures facing common testing challenges.

Consequently, the Testing Grouplet worked with Testing Technology and Build Tools to make it possible to mark known-flaky tests as part of the BUILD language; and the Chris/Jay Continuous Build system was designed with the concept of a “stable” test suite that would execute first, followed by a “golden” test suite that might contain flaky (or just very slow) tests, which mitigated the effects of flaky tests somewhat. One of the components of Test Certified Level One was marking known-flaky tests in some way, as a means of isolating them from stable tests in the short-term—enabling the stable tests to demonstrate their value absent interference from the flaky tests—and identifying them for repair or replacement in the longer-term. Knowing which tests are flaky is half the battle.

Tools

The right tool for the job can often make the difference between the natural and the impossible, and in the Google of yore, writing good tests that provided useful feedback in a reasonable amount of time was a challenge given the toolset of the day. Again, Coding and Testing at Google, 2006 vs. 2011 describes the overall feeling of friction and futility that accompanied the practice of automated testing for much of Google enginering at the time, compared against the opposite feeling of security and productivity that the new tools afforded when combined with company-wide automated testing and design knowledge. This revolution in development and testing tools was largely the result of extensive collaboration between the Testing Grouplet, Testing Technology, and Build Tools over the years. Testing on the Toilet and Fixits went a long way towards putting the new-and-improved tools in everyone’s hands as quickly as possible, which in turn made it easier and easier for teams to adopt Test Certified practices—whether or not a team actually enrolled in the Test Certified program.

Launching, Recognition, and Promotions

One of the biggest contributing factors to the “I don’t have time to test” perception was that testing was not only not essential to launching a product or a feature, but often only threw obstacles in the way. Launching is one of the most important pieces of evidence an developer can provide in making the case for a promotion, as are public forms of recognition by management or peers, and promotions are of extreme cultural value within Google development. Consequently, writing tests was not only seen as a luxury for those fortunate enough to have the time, but was also perceived, to various degrees amongst teams and individuals, as an activity that opposed core Google cultural values. Even the success story of former Testing Grouplet leader Bharat Mediratta’s Google Web Server team, which has been broadly recounted many times within Google over the years—as well as Bharat’s own substantial personal success within Google—proved a relatively ineffective form of propaganda in reversing this perception during the first few years of Testing Grouplet activity.

There was no clear, direct solution to the problem of launching and promotions; just the brute force of unrelenting effort dedicated to improving knowledge and tools all-around until hearts and minds were won on other merits. Still, we were at least aware of the perceived conflict, particularly as it applied to managers who did little-if-any coding and wanted to focus their teams on “productive” effort. We just had to keep doing what we could to educate and empower developers in the trenches, so that the results of improved testing would become apparent despite a lack of widespread managerial support.

We did, however, play the recognition card to the extent we were able. The Test Certified Ladder, a parallel to the Google Engineering (promotions) Ladder, made public which teams were participating, and at which level. Test Certified teams were rewarded with fun little prizes, like build-monitoring orbs, Star Wars-themed Potato Head toys, and lightbulb-shaped containers of red and green M&M candy. The Test Certified levels and tasks were explicitly designed to fit in with the Objectives and Key Results format that individuals and teams, as well as the whole company, use to set quarterly goals and evaluate progress. And, of course, Fixits were all about prizes, from small tokens for everyone who participated all the way up to rather expensive rewards for the top performers. Plus, after the development of Fixit-tracking prototypes that eventually evolved into David Plass’s Gladwell platform, it was easy to see who was participating in a Fixit all around the world, what specific tasks they were tackling and the code changes they made, and who was in the running for top honors. Pavlov lives.

Also, the aspect of peer pressure as it applies to team dynamics and the peer review/promotions process was not lost on us. The core component of Test Certified Level Two was the requirement that teams adopt a testing policy that requires all new “nontrivial” changes to the project’s code be accompanied by automated tests. While it’s hard for a single actor to stand up in the midst of a team and push against his peers to improve their testing discipline, particularly if that actor is not a senior developer or tech lead on the project, once the team as a whole adopts a policy, such activists have a leg to stand on when insisting on better testing, and to justify their own effort spent towards testing. Even moreso, if a project receives a number of changes from developers outside the team, the policy provides the grounds on which such changes can be accepted or rejected; and since code contributions to other projects are also a nice detail in one’s peer review portfolio, outside contributors have the incentive to comply with the policy to ensure their changes are accepted rather than fight against it and risk having their changes rejected. People could be rewarded or reprimanded at peer review time based on their adherence or rejection of team testing policy.

As an increasing number of teams adopted Chris/Jay Continuous Builds and testing policies, largely thanks to the push behind Test Certified, even developers who were steadfastly against testing could no longer resist the pressure coming from other teams when they reported breakages due to such developers’ changes. After all, it’s one thing to eschew testing on your own project, but it looks Really Bad if developers from completely different teams start pointing their fingers at you come peer review time as a reason they were less productive than they might’ve been. By the time the Test Automation Platform rolled out, practically every team at Google had at least one continuous TAP build, practically all of them were committed to keeping it/them green, and it was so fast and so trivial to pinpoint and report breakages across projects that such breakages were often fixed before most teams even noticed.

Legacy Code

Though Google was only founded in September 1998, by the mid-2000s, there were a large number of very large systems that depended on what could be called “legacy code”. It’s not that the code was unsupported; in fact, it was imperative that a very large percentage of the code base continued to evolve via bugfixes and new features to support an ever-expanding portfolio of projects. But there were significant sections of code that people were very reluctant to change, that just weren’t all that fun to work with, because of the way the code was written. Such code was difficult to understand, in terms of both its logic and the extent of its effect on other code and systems, which made it difficult to change—the most nerve-wracking changes of all involving any code that Ads depended on.

A lot of the Testing Grouplet’s effort was not only directed towards educating developers on how to write more testable code in the first place, but educating them on how to tease apart the pieces of existing systems and slowly, over time, get them under test. We relied to some extent on publicizing the methods advocated by Michael Feathers’s Working Effectively with Legacy Code; Nick Lesiecki created a phenomenal Codelab (i.e. internal training document) applying some of Feathers’s techniques to a real piece of Google code. As mentioned earlier, the Test Mercenaries were formed to give hands-on assistance to a number of large product teams to help get existing code under test. Improved build tools not only made it faster and easier to write and run tests for such code, but to experiment with refactoring it into smaller pieces and applying good testing practices, using techniques and tools largely popularized by Testing on the Toilet.

At the level of actual source code, some of the biggest problems people would run into were static objects and methods, too-large classes and functions, and overuse of implementation inheritance. From my perspective, nearly all of the specific cases of “hard-to-test” code boiled down to these three core issues, compounded by a lack of efficient build tools and adequate testing tools at the time. (Multithreading issues also made testing difficult in some cases, which we didn’t really address to the same degree, though improving testability via more modular designs and writing smaller, focused tests tended to improve reasoning about multithreading as well.) I’ll describe these code-level issues in the next post, starting off with a word about the Good News that is dependency injection, which went a long way to curing each of these ills.