Mike Bland

Music student, semi-retired programmer, and former Googler

Test Certified

The Testing Grouplet's program for promoting good automated developer testing practices throughout Google Engineering

- New York
Tags: CJ, Google, grouplets, Jimi, SML, technical, Test Certified, Test Mercenaries, Testing Grouplet
Discuss: Discuss "Test Certified" on Google+

Test Certified (aka “TC”) was kind of a twelve-step program (without twelve steps, exactly) for getting teams to improve their developer testing practices and test coverage, developed and promoted by the Testing Grouplet. It helped teams establish the measurements, early goals, and commitment necessary to achieve noticeable results quickly, and then to follow through on habits and policies to promote longer-term code health. As hinted at in my earlier Testing Grouplet post, getting all engineering teams on the “TC Ladder” became the concrete focus of both the Testing Grouplet and Test Mercenaries starting in mid-2007, and eventually Test Engineers (TEs) and Software Engineers in Test (SETs) throughout the Engineering Productivity focus area used it to drive improvements within their product teams.

Test Certified was the brainchild of Bharat Mediratta and Nick Lesiecki, the original leaders of the Testing Grouplet, sometime in 2006. Mamie Rheingold was TC’s patron saint at the time I ran the Testing Grouplet, nearly single-handedly sustaining its momentum until Bella Kazwell (née Voldman) assumed ownership of the TC program and grew it enormously during her tenure. Patrick Doyle then ran the program as it continued to gain traction throughout the company. Tracy Bialik and Russ Rufer administered the program up until the time I left Google; I assume they still do.

Something to keep in mind as you’re reading the details: Recently I happened across the book Willpower: Rediscovering the Greatest Human Strength (thanks to a Google+ post by Brian Slesinsky), and it underscores the importance of taking measurements and setting achievable goals to develop the self-control necessary to improve a situation. This may seem like basic common sense, but it’s the kind of thing that we easily take for granted until someone points out the obvious within plain sight, and the book describes all the social science that’s been done to reinforce the point. With Test Certified, this set of goals and measurements worked both ways: Teams could get a handle on their testing practices, and the Testing Grouplet—and, eventually, the Test Mercenaries, and to some extent Test Engineering/Engineering Productivity—could measure its impact on the broader engineering organization, via tracking how many teams were on the ladder, at which phase, the difference quarter-to-quarter, etc.

Test Certified was comprised (originally) of three levels, which themselves were comprised of smaller tasks. Some tasks could be completed very rapidly, while others took more time and commitment. As one may expect, the levels took progressively more time and effort to reach from one to the next.

Level One was about setting up the tools necessary to measure the current state of affairs, such as: setting up a continuous build machine and test coverage metrics; classifying tests as “small”, “medium”, or “large” in size (which I’ll explain in a future post; think “unit”, “integration”, and “system” for now); and identifying “flaky” tests—aka “nondeterministic” tests, whereby a test with the same input data for a piece of code could produce different results, due to some uncontrolled, “external” dependency such as the system clock or an external service. This level was designed to be easy to achieve within a day or five of effort, by one or two engineers. This was the most critically important set of tools and practices for every team to adopt, insofar as when folks get the proper measurements in place, they often become self-motivated to improve their testing habits substantially even without explicitly running the gamut of TC levels. (This is akin to the observation in the Willpower book that buying a bathroom scale and taking daily measurements can motivate one to achieve weight loss goals and keep one’s weight under control from then on.)

Level Two consisted of establishing a written policy that essentially forbade anyone from submitting untested code, as well as goals for test coverage and a balance between small, medium, and large tests. These concrete numeric goals were a bit more contentious, but they were only suggestions, guideposts that teams could use and later modify as necessary. Depending on the project, based on the structure of the code or the application as a whole and what proved reasonable, a team could be “certified” at Level Two without coming anywhere near the stated coverage levels or test size ratio. We anticipated that teams could draft and accept a policy right away—I drafted a policy template stolen from Bharat’s team’s policy, which became the basis for hundreds of individual team policies—and spend a couple weeks or months working towards the coverage and test-balance goals.

Level Three was the model of a smooth-running testing process, as best we understood and could express it. It was basically about sustained high coverage for all types of developer tests across the board, with coverage goals for each group of tests of a certain “size”, as well as low tolerance for broken or flaky tests. This level represented an on-going, long-term commitment that a team would strive to maintain in perpetuity.

Eventually a Level Four and a Level Five was added, with more stringent coverage goals and tasks incorporating other tools such as static analysis tools. I didn’t really participate in these conversations; Level Three was enough for me. I’ll explain a little further below.

The “certification” process was a somewhat informal affair, later codified slightly into the following: A team would request a Test Certified “mentor”, a volunteer who had been indoctrinated into the ways and lore of the TC Ladder and who was entrusted with guiding others along the way. The team would be assigned a mentor, who would personally confer with the team and get them started with tools, measurements, and policies. The team would occasionally check in with the mentor, or vice versa, and once both believed that a certain level’s requirements had been fulfilled, the mentor would request a review on the TC mentors’ mailing list. If, after some peer review, the consensus was that the team had met at least the spirit, if not the exact letter of the requirements, the team would be awarded Level X Certification.

Doesn’t sound scalable, does it? Well, as with Google’s build system, we developed some tools and protocols to deal with the administrative load. Beyond that, after the Testing Grouplet had decided on getting all engineering teams on the TC Ladder as its primary mission, I lobbied folks in Engineering Productivity to use Test Certified as the concrete vehicle by which they could encourage their client teams to improve their code quality make better use of the TEs’ and SETs’ time, by allowing fewer low-level bugs to slip downstream. The idea eventually caught on, and Engineering Productivity became a huge force behind TC’s growth and impact, by providing a huge host of volunteer mentors—every new mentor takes a bit of load off of every existing mentor—and focusing tool development such that achieving Level One Test Certification and beyond became easier and easier, eventually culminating in the Test Automation Platform (TAP). Of course, we had Testing on the Toilet (TotT) as a vehicle to promote TC itself, as well as advances in testing techniques and tools, so that mentors and teams throughout the company could stay in-the-loop and share a common vocabulary, reducing friction that much further. The Test Mercenaries provided hands-on guidance for teams that needed the most help—in fact, that’s why they were started; more in a future post—as well as invaluable in-the-trenches feedback regarding how achievable and relevant the goals were and the condition of the tools available to help reach them.

Plus, we made it fun, of course. There is no great work without fun, because there is no way anyone can follow through on such actions, attract others to your cause, and maintain one’s sanity without taking creative liberties and laughing with one another—and even then, I didn’t always achieve my personal sanity goals. We came up with a Test Certified logo based on the Testing Grouplet logo and made T-shirts and coffee mugs with the slogan “Arrre you Test Certified?” (Yes, a pirate theme. The Testing Grouplet community had a thing for romanticized outlaws.) We filled plastic lightbulbs full of green and red M&Ms to give out as little rewards. There was the Test Certified Challenge, organized by Matt Vail and Tayeb Karim, which was a months-long Fixit that encouraged tons of individuals, teams, and entire offices to climb the TC Ladder for points, prizes, and glory. In New York, David Plass, Prakash Barathan, Catherine Ye, Tony Aiuto, and others built a Statue of Liberty build monitoring lamp as a reward for New York teams that achieved TC Level One.

First, the great news: Test Certified more than did its job, as far as I’m concerned. In addition to a very large number of projects signing up to participate over the course of about four years, TC drove so much discussion, process change and tool development that its policies and practices became essentially standard procedure for most Google engineering teams, whether they were officially on the ladder at a certain level or not. Even though we didn’t literally get every single team to reach Level Three according to the exactly-as-stated coverage and test size goals1,2, effectively we got every engineering team to pay attention to its testing hygiene and code health—and to expect higher standards from every other team in the company as well. In my mind, that was all that mattered. The ends justified the means, regardless of the specific shape of the outcome compared against explicitly stated goals.

What’s more, Test Certified was developed and promoted bottom-up by a group that had zero authority to dictate process and policy to anyone, and as a consequence, it helped effect real, significant, lasting change. The very afternoon after I first lobbied a gathering of Test Engineering folk to adopt Test Certified as a means to have a meaningful dialogue with product teams, I met Patrick Copeland, the Director of Engineering Productivity, for the first time. I lamented that we still needed then-CEO Eric Schmidt and other high-level executives to openly and forcefully endorce Test Certified, and developer testing in general. Patrick then related that he’d come from Microsoft, Eric from Novell and Sun, Alan Eustace from Digital, Bill Coughran from Bell Labs, etc., and at each of those companies, these guys have seen corporate mandates fall flat with engineers and fail time after time. They’re waiting for good, well-implemented, and street-proven ideas to bubble up for them to get behind when the time is right. That totally changed my perspective and thought processes from then on, and I never gave another thought to securing such an endorsement. In the end, while we received a few kind nods from high-up execs on rare occasions, we managed to achieve a sweeping, permanent culture change on our own, without help (or, thankfully, interference or deterrence) from on high.

Now, the dark side: Some engineers love to get mired in the niggling details, what’s effectively measurable and what’s meaningful and what’s important vs. what’s not, yada yada, blah blah woof woof, ad nauseam, etc. Three TC Levels plus a spirit-of-the-law interpretation of the coverage and test size-balance goals seemed plenty to me to accomplish the mission of improving developer testing discipline throughout Google, but such debates raged on anyways and extra levels were created. Even when I was leading the Testing Grouplet, these discussions exhausted my patience such that I eventually withdrew from them and became less-and-less involved as a TC mentor. Thankfully, there were more folks with the energy, enthusiasm, patience, motivation, and self-control to deal with these debates, to pick up my slack and keep everything on-course.

The only time I would actively step in to kill a topic of discussion is when someone proposed changing the “Test Certified” name to something else—again. We knew the name sucked, but we had already spent months mulling over less-than-ideal alternatives. The proposal, and its justification, was straightforward: If anyone had a better name that a critical mass of volunteers could immediately agree to, then we’d adopt it posthaste; otherwise, discussion was not to continue. Precious volunteer time, attention, and energy was not to be wasted on such fruitless, repeated bikeshedding. The name never changed, and ultimately never seemed to matter.

That being said, despite my distaste for discussions that went off the rails in terms of usefulness or bureaucratic hair-splitting, there’s no denying that many more discussions regarding Test Certified’s criteria and goals—discussions often alternately provoked and summarized by TotT—raised the community’s level of understanding of the issues surrounding developer testing; awareness of the available techniques and tools; and sophistication in terms of choosing appropriate goals, processes, and policies. Engineers themselves weighed and debated the relative merits of every piece of the program, and made conscious decisions to adopt, adapt, or reject each piece, resulting ultimately in a much stronger program, and a much more pervasive culture of maintaining very high code quality throughout Google.

1In fact, Bharat’s own project, after which TC was modeled, never was Level Two-certified, to the best of my knowledge, but no one in the entire company doubts their dedication to good test hygiene and code health.

2Well, there was one Noogler in Sydney that instant messaged me on a tip that I was a person-who-knew-people, asking me to talk to that team to persuade them to allow him to submit a change to improve the testability of a particular component. He decried their lack of testing discipline rather desperately, until I told him that the team lead used to run the Testing Grouplet and that TC was modeled after that team’s process and policy. Humbled, he resumed the conversation with the team, resulting in a satisfactory conclusion.