Coding and Testing at Google, 2006 vs. 2011

The before-and-after picture of the Testing Grouplet et. al.'s impact on Google Engineering

02 Dec 2011 - New York
Tags: Build Tools, Eng Prod, Fixit Grouplet, Fixits, Google, TAP, Test Certified, Testing Grouplet, Testing Tech, TotT, grouplets, technical

In which we come down from our 50k-foot overview of this [Grouplet][]-related program and that, and get a feel for what it was like in the trenches as a Google engineer with regard to coding and testing.

Summer 2006
Summer 2011
Afterword

This should provide some perspective on the problems that the Testing Grouplet, Testing Technology, Build Tools, the Fixit Grouplet, and Engineering Productivity were tackling as a joint effort from the years 2005-2010. It is an amalgam of my own experiences, and the experiences of others that I happened to observe. Any distinct similarity to specific events or individuals is purely coincidental.

Googlers: Feel free to fact-check me, and I’ll make corrections as necessary.

Summer 2006

The Engineer slides his legs under his desk with a non-fat, double-shot latte from the microkitchen in-hand, his belly full from a full-sized breakfast of a custom-made omelette, buttered toast and fresh, locally-harvested fruit and orange juice from the cafe. Some of the more socially-inclined morning people in the office mutter a barely-audible, but sincere “Mornin’” under their breath in between lines of code, and the Engineer unlocks his Linux workstation to pick up from where he left off yesterday.

Maybe today I’ll finally get to integrate my new feature, he hopes. I’d hate to miss tonight’s cutoff before the monthly release.

Perforce

At 9:47am, Pacific Daylight time, the Engineer lays his fingers on the keyboard and issues the commands to synchronize his three most important source code working directories, or “clients”, with the most recent source code versions stored within the depot on the main Perforce server. There’s a good chance that nearly 50% of the other engineers in the company are initiating the same ritual at this moment: Sync to pick up changes from yesterday, especially from the engineers in far-off lands who work their hardest while Mountain View dreams of electric sheep; check email, and maybe jump into the middle of a few contentious technical, cultural, economic, or political discussions; go through some code reviews pending in one’s inbox; maybe surf Slashdot for the daily injection of life outside the ’Plex—and then check to see if the sync is finished. If so, hooray! On to the next step. Otherwise, back to the info-trough.

Certainly this new view of the single source code depot—the very same shared across websearch, Gmail, ads, and the rest—will contain dozens of bug fixes, performance improvements, and nifty features. The time-honored, cultural gating mechanism of code reviews, combined with per-language style guides applied throughout Google—pragmatic development traditions maintained by the Readability Grouplet, one of the most hard-working, immensely valuble, underrecognized and underappreciated teams in the whole company—have enabled this single-depot model to scale beyond everyone’s expectations. By and large, anyone can look at any code in the Google source depot, and make a change, as the code looks nearly as familiar as that of one’s immediate project from a formatting and idiomatic point of view. Right things look right, and wrong things look wrong in the same way across the company.

But as every engineer who survived Noogler training knows, code reviews and standard style guides go only so far, and there are probably a few dozen new bugs, performance degradations, and features that don’t quite work as-advertised yet. Everybody wants to leave a thumbprint on google.com or one of the related properties, both for bragging rights and as concrete performance review input, but these well-intentioned “improvements” sometimes leave something so broken that projects depending on them will no longer build, or potentially worse: Subtly broken in a way no one will notice for some time to come.

The extended silence on the command-line rings out like a cry for mercy from the Perforce server. The internal BUILD language, which defines the build targets and dependency structure for Google projects, prevents any individual user from checking out all of the code in the depot, but given the very broad and deep degree of code reuse across projects, a typical project that depends on Bigtable, Chubby, GFS, possibly MapReduce, and more ends up depending on a significant fraction of all the code in the company. The Perforce replica servers don’t lighten the load on the main server and diminish the wait time as much as everyone would like.

The synchronization finishes and the command line prompt returns. 10:09am. The Engineer sighs; there are a couple of conflicts with the changes he’s been preparing, including the most important change to integrate his new feature. But it seems there are no major problems; he edits the files to resolve the conflicts in less than five minutes.

`BUILD` language and `make`

The Engineer builds the code in his most important client to find out what broke since yesterday. Fortunately, not all of the code in his client needs to be built; many directories of source code, known as “packages”, are present in the Engineer’s client from which only a couple files are required—say, a single C++ library comprised of a single header and implementation file, or maybe a protocol buffer definition or two which many times don’t depend on much else—but the BUILD language is not savvy enough to determine dependencies at this granularity, and unlike code style standards, there is no engineering-wide notion of a package or component structure to guide the organization of source code directories to minimize dependencies and the amount of code that is checked out by dependents. Not every package in the depot is pulled into the Engineer’s clients, but he would swear such was the case after the fifteenth minute of waiting for his Perforce synchronization to complete. Code continues to get added to each package, along with new dependencies on packages needed for new features, day after day, multiplied across thousands of clients depending on such packages, while Perforce gently weeps.

The build happens in two steps: First, since the BUILD language is not interpreted directly as part of the build process, the Engineer has to run a fairly complex Python script to compile it, by building the full dependency graph from the transitive closure of BUILD files in his project, determining only the targets that are needed by the project, and generating a megabytes-large Makefile. Whenever the Engineer syncs his client, or changes a BUILD file himself, this process must be repeated to generate a new Makefile for the project. Otherwise, Bad Things Happen.

Then comes the make command. Nothing fancy, just your stock GNU Make. make launches, parses the megabytes-long Makefile to build a dependency graph, checks the timestamps of all of the input objects and build products—thousands of them—determines an order for the build actions, and begins compiling, linking, etc. Then it exits. That’s how make works; build dependency state, check dependencies, execute, exit. This cycle usually takes on the order of minutes at a time, even for a change to a single file. Fortunately compiles are distributed to a shared pool of machines using distcc, but all preprocessing, linking, and test execution happens locally, and the distcc pool is relatively small and can get bogged down easily during peak hours.

While that build runs, the Engineer switches to work in another client, on a bug fix for an issue that had just been reported the day before. He pores over code and a bit of documentation, eventually diagnosing the problem. He begins to write a fix, but he doesn’t dare try to build it until the other build finishes first.

Dependencies

The first time, the build for the new feature isn’t successful. The Engineer isn’t surprised. It seems that there was a package midway in the dependency hierarchy that cut a dependency that the Engineer’s project actually needed. The source code from one of the Engineer’s packages had a direct reference to code from this other package, but the BUILD language wasn’t strict enough to catch this undeclared direct dependency when it first appeared. Adding that logic to the language would probably complicate and slow down the BUILD compilation script, the generated Makefile, or both. Most things get by without it, anyway.

Looking through the change history, it seems that someone in Zürich had made the breaking change twelve hours ago, and given the nine-hour time difference, it’s not surprising that his chat status is “offline”. Fortunately, the Engineer’s project was one of the few directly depending on the offending package at this point, and declaring the direct dependency is the right thing to do anyway.

If this had happened at a lower level of the dependency hierarchy, the change would certainly have been rolled back posthaste; it probably would’ve broken nearly every Google project, or at least a huge percentage of them. If every broken project had its dependencies fixed separately, Perforce would choke on the load of thousands of files being re-synced thousands of times once engineers try to integrate the update at the same time, and hundreds or thousands of engineers would have to re-compile the BUILD graph. It comes down to deciding on the lesser evil: A single change to put things back as they were vs. slamming on the brakes and breaking the flow of a significant fraction of Google engineering. No surprise the web of BUILD dependencies was broad as the horizon, yet brittle as thin ice.

The Engineer adds the missing dependency to the broken BUILD target, then recompiles the BUILD graph and launches make in a single command-line invocation. He then goes to the microkitchen for a Diet Coke, as his latte cup had grown cold and empty.

Microkitchens

A year from now, every engineer on the way to the microkitchen will glance knowingly at the copy of xkcd 303 posted on the announcement board along the way. The episode will show two engineers swordfighting in the hallway, using “My code’s compiling!” as an excuse. This comic will appear in multiple places in every Google engineering office. But at Google, no matter the time or location, one does not swordfight. One makes lattes.

Though the Engineer was only going to grab a Diet Coke and go, there was inevitably a small congress of engineers from nearby projects huddled around the espresso machine and other tables. Before he arrived, there was some chatter about the past weekend at Tahoe; about the paging performance of the memory manager in the latest Linux kernel; comparing the robots that the MIT AI lab is developing now compared to the ones one of the engineers worked on during her time there; about Microsoft and Yahoo! employees caught trying to sneak into Charlie’s Cafe. As the Engineer approached, one of his colleagues from this cluster of conversationalists perked up.

“Hey man, how’s it goin’?”

“Alright. Just kicked off a build, thought I’d get up and get another drink to avoid getting sucked into another discussion thread.”

“I hear ya. I’m trying to sync a client right now. Good time for coffee.”

“Yeah. Amazing that we can find any piece of information in the world in microseconds, but we have to wait half an hour for something to build after changing a line of code.”

“Or sync a one-package project that depends on more than the base libraries.”

As the Engineer later returns from the microkitchen with a half-full Diet Coke—one does not simply walk into the microkitchen and walk right back out at moments like these—it seems that the build has finished, but in addition, his calendar has popped up a notice that his 11:30am meeting is starting in ten minutes. This had popped up thirteen minutes ago. No problem; the meeting happens to be in the building next door, and Google time doesn’t start ’til seven after anyway. The Engineer grabs his PowerBook and leaves.

Test Slowness and Flakiness

After the meeting, the Engineer and his teammates go straight to the cafe, and indulge in a delicious gourmet meal, just like any other day at the office, and upon returning to his desk he examines the results of the build that had finished while he was in his meeting. The build appears to have completed successfully. Now comes the part he really dreads: Executing the unit test suite.

Unlike compile steps, tests cannot be remotely executed by distcc. The Engineer launches the test command and waits as the tests run one-at-a-time on his workstation and consume nearly all of its available CPU cycles. There was some recent relief in the form of an internally-patched dynamic loader; tests that used to take 45 seconds just to resolve symbols before executing main() now took less than one second to reach that point, but many tests were still overly-broad and took on the order of 30 seconds or more to run.

Most engineers never even bother with writing tests for this reason; they see it as a time sink for little tangible benefit, especially when just checking out and building the code takes so long in the first place. They figure that most errors will probably get caught in code review or by the manual testing done by the Test Engineers anyway. On the contrary, the Engineer and his team hold to the belief that unit tests are vital to having some degree of confidence in the system, but are frustrated that the friction imposed by the slow build tools leads most engineers, himself included, to produce large test programs that execute slowly and exercise too much code. They recognize at least the theoretical value of unit tests, but they have time to make them only so good.

Seven minutes into the test run, the Engineer switches back from processing email and a code review to check on progress, and two of the tests have failed. Neither of them was the test he had written for his new feature; he finds it difficult to believe his change was the cause of any of the failures. In fact, one of the failing tests is chronically flaky, meaning that it would pass or fail somewhat randomly even if you ran it ten times in a row without changing anything. This is a sign that the test does not explicitly control all of the inputs necessary to produce a meaningful execution context for the test, or allows too many unnecessary inputs to influence it. Somebody will eventually get around to figuring out the exact root cause of the problem and fixing it, some day, but the world hasn’t ended yet, and there was plenty other work to do, so it remains a non-priority for the team. The failure from that test looks typical, so the Engineer ignores it for now.

Test Granularity and Lack of Hermeticism

Scrolling back through the output of the second test, it is hard to tell exactly where the problems are, but it definitely doesn’t smell like a mess the Engineer himself has made. But, the failure is indeed troubling; this is a test that checks the output of a major component of the system against a “golden file”, basically a snapshot of output from a previous run that contains all manner of system-wide output that is good for detecting when something changes, but not exactly what has changed, nor whether the change is good or bad. Other components that the project depends on change all the time; often people from those projects have to be pulled in to decide whether a golden file test breakage was good, in which case the golden file just needs to be updated; or bad, in which case the other team has to roll back a change or submit a quick fix. Sometimes the breakages are for other stupid reasons, and rarely is the breakage due to an error in the code that the Engineer is attempting to test. It also takes ten to fifteen minutes to run. Still, it’s better than nothing, and nobody has time to make the test any faster or better.

After about ten minutes of poring over the test logs, something finally jumps out at the Engineer. Indeed, his change doesn’t seem to be at fault at all; there was a change in output that looks like it depends on a change to a data file that isn’t checked into the Perforce depot, but which is stored on the Network File System share. The build system is very permissive when it comes to build input sources; not all of them have to be safely version-controlled within Perforce. Indeed, given the exorbitant amount of time it already takes to check out and synchronize Perforce clients, a lot of times it makes sense to store large or frequently-changing data files or compiled programs on the globally-accessible NFS share. The upside is a big time savings when using Perforce; the downside is a lack of hermeticism, whereby the product of a build or the result of a test is not purely a function of the inputs managed by Perforce, making it difficult to account for changes and breakages.

OK, time to shoot off an email to the backend team to see if they can take a look at this output to make sure the change is expected, he laments. There’s a tech talk on the contribution of virtual machine-based and interpreted languages to datacenter inefficiency and global warming at 2:30 I want to check out anyway. They should get back to me by the time I get back.

Testing on the Toilet

On the way to the tech talk, the Engineer makes a necessary stop in the mens facilities. This was a more interesting experience than it used to be, since some funny volunteer group crazy about testing started posting these flyers in the stalls and over the urinals. They’ve been going on for a couple months by now, saying things about how to write proper unit tests, controlling time, writing your code a certain way, writing “small” tests, and so on. Cute, and definitely food for thought, but what good will it do?

This week they’re advertising something about a “Testing Fixit” coming in a few weeks, at the beginning of August. Now, Fixits, those are fun, he thinks. Those are the few times when it seems like working on a broken test may actually get you somewhere. Good luck to those guys. Maybe I’ll do something for that day, especially since they usually have T-shirts. I love T-shirts. Have to talk about it with my team a little bit.

Bugs and Coverage

Checking email after the tech talk, the Engineer receives confirmation that, indeed, the change in golden file output does look consistent with the update to the data file on NFS. So he kicks off a new run of the test binary in golden file-regeneration mode, and goes back to his other client to work on fixing the bug for a few minutes.

At 2:57pm, the fix is essentially done. Does he really need a test for this? No, he decides, it’s very small and very obvious, to test it might require writing a new function or a new class, but there’s no point in creating new classes or possibly even new files just to test something. More classes and files just add complexity.

The Engineer begins to recall that he had reported a bug a month ago where an engineer didn’t write a test for a “small and obvious” case. It took the better part of a day to track it down, and another day to submit the fix for review, get an approval, and submit. That was beyond irritating. Oh, and there was the one last week. And two or three of his teammates also mentioned similar emergency bug hunts, one of which involved a production crash that proved difficult to isolate and reproduce; that one took a week and a half to totally resolve. Still, this change is OK, even if it is too hard to test. Maximizing test coverage for its own sake is a stupid goal anyway. He sends the change to his tech lead for a code review.

The Engineer remembers that he had regenerated the golden file for the other change; he runs the golden file test again to make sure that everything passes. In the meanwhile, he catches up on a couple of code reviews he owes his teammates, then switches to his third client to work on another new feature scheduled for release the following month.

At 3:15pm, he notices that the golden file test passed; no more work to be done there until his change gets approved. He turns around to get his teammate’s attention, and politely asks him to take a look at the last round of the code review that the Engineer had bounced back to him yesterday. His colleague agrees to get to it in just a few minutes, and the Engineer goes back to focusing on the other new feature.

Submitting the change

The Engineer catches an update in the Gmail tab of his browser. Switching to check his inbox, he sees that his teammate just approved his change. “LGTM; just grammar and style nitpicks. Please submit after addressing.” The Engineer fixes a subject/verb agreement in a comment, removes a heretofore undetected trailing whitespace, and since he’d just synchronized the client containing the change, built everything, and ran the few horrible tests, he’s confident the change is ready.

He executes the command to submit the change to the Perforce source code depot. The familiar silence of the command-line returns while Perforce automatically synchronizes his client before committing the change. The Engineer checks the clock; it’s 4:27pm. A lot of engineers are likely trying to submit right now, during that sweet time after all the email, all the meetings, all the code reviews, and finally, some real work has been done and it’s time to get it checked in before going home in an hour or two. This one is going to take a while.

The Engineer takes another trip to the microkitchen, and decides to squeeze in a round of billiards with his compatriots who are also waiting for either a build to complete, tests to run, or Perforce to finish syncing or submitting. The first round goes quickly; these folks have had plenty of practice. So they go in for another. Then just one more, to be sure the command line will be ready to accept new input before returning.

During each game, in between rants about the river of molasses they have to cross to get code written, updated, and submitted, and light banter with the occasional sprinkle of geekier-than-thou oneupsmanship, they do manage to talk a little about what they’re working on, the design issues and use cases, ways the language they’re using helps or hinders certain implementation details, etc. So these intermissions have some productive value; just proportionately lower than the value to be gained from remaining in the flow and moving from the most recently completed work item to the next.

The Engineer returns to his desk. He does not find a fresh command-line prompt eagerly awaiting his input. Instead, Perforce has detected merge conflicts; incompatible edits to the same files that the Engineer has changed and is trying to submit. One of his teammates had submitted new changes to these files just half an hour earlier. These conflicts are much more numerous and tricky than the ones he dealt with in the morning. He needs to manually address each conflict and edit as necessary, just as before, but these will take significantly more care. Then he needs to build the code—again. And run the tests—again. And try to submit—again, though by 6:13 in the evening, he clearly won’t get this done in time to submit the change tonight.

Of course, he could submit the updated change with the existing approval, but that’s bad karma when dealing with a conflict of this magnitude; it really is best to bounce the code review back to his teammate again for another approval. Maybe he’ll get lucky and be able to submit first thing in the morning. But so what? The release engineer has already launched a Perforce synchronization command for this month’s release before going home for the night. At least now it’s late enough that he can focus on fixing the problems with relatively little distraction from emails, meetings, tech talks, coworkers and code reviews.

By 7:19pm, the Engineer has resolved the problems with the code and run a few initial tests to make sure things mostly work. He kicks off a full build of the entire project and its tests to make sure he didn’t break anything else for his team, and uses the wait time to read some of the emails backed up in his inbox and tinker with the 20% project he’s woefully neglected as of late.

The build finishes. The tests pass. The Engineer pings the code review email thread to ask his now-absent teammate for another look.

The End of the Day

Code reviews and readability standards, both incredibly powerful and virtuous instruments, eventually allowed the single-depot source code management model to scale intellectually and culturally beyond the capacity of the tools to manage and build the code efficiently; a rapid checkout-edit-compile-test workflow was a dream of the distant past. No single engineer, not even the Engineer, could hope to push back against the weight of all of the productivity sinks and hope to get anything done in time for the end of the quarter—or for his next performance review.

Code gets added. Perforce gets slower. Builds take longer. Tests take forever. Code goes untested. Dependency cruft builds.

Hundreds of balloons welcoming Nooglers to their first Google desk continue to sprout like mushrooms throughout every floor of every building of every engineering office. The balloons attached to the two desks just outside the Engineer’s office are sagging a bit, but week after next there will be a fresh bouquet decorating the currently-empty desk next to him. The half-inflated balloons sway as the Engineer glides past, eyes straight down, on his way to the exit at 9:06pm.

Summer 2011

Maybe today I’ll finally get to integrate my new feature, he hopes. I’d hate to miss tonight’s cutoff before the weekly release.

Perforce

At 9:47am, Pacific Daylight time, the Engineer lays his fingers on the keyboard and issues the commands to synchronize his three most important source code working directories, or “clients”, with the most recent source code versions stored within the depot on the main Perforce server. There’s a good chance that nearly 50% of the other engineers in the company are initiating the same ritual at this moment: Sync to pick up changes from yesterday, especially from the engineers in far-off lands who work their hardest while Mountain View dreams of electric sheep; check email and Google+, and maybe jump into the middle of a few contentious technical, cultural, economic, or political discussions; go through some code reviews pending in one’s inbox; maybe surf Reddit for the daily injection of life outside the ’Plex.

It used to be that the Engineer could do all of this well before his clients finished synchronizing. Since the introduction of SrcFS, however, only the source files within the specific packages which he is editing are in his Perforce client. All of the other code necessary to build each of his projects is cached in a Bigtable, exposed through a user space file system that only downloads the contents of a file when it is opened. Sure there are still plenty more dependencies between BUILD packages than he’d like, but engineers throughout Google are no longer paying the cost of checking out code that they don’t need to change—or even build—while still getting the benefits of rapid updates across the code base.

The synchronization finishes and the command line prompt returns. 9:48am. The Engineer sighs; there are a couple of conflicts with the changes he’s been preparing, including the most important change to integrate his new feature. But it seems there are no major problems; he edits the files to resolve the conflicts in less than five minutes. Reddit and the internal discussion lists will have to wait.

`BUILD` language and `blaze`

The Engineer builds the code and runs the full test suites in all of his clients to find out what broke since yesterday. Each build and test invocation happens in one step: the in-house blaze build tool parses the BUILD language according to a strictly-defined grammar, determines the build actions required by comparing the checksums of input objects (made faster by SrcFS, which exports the checksum as an extended file system attribute for files not directly in the working directory), and executes the build. It constructs the full dependency graph from the transitive closure of BUILD files in his project, but executes the majority of the build actions in parallel using the custom infrastructure that executes build actions in the datacenter, rather than on the local machine or a small distcc pool. It also takes advantage of objects cached from previous builds by other engineers, effectively reducing the time it takes to build the entire project such that it’s dominated by the time needed to recompile the files actually changed in each client. All tests are executed in parallel using the same framework, meaning that the majority of the tests finish running within seconds, and the overall time is dominated only by the longest-running test.

The more powerful build tools were initally rolled out during the Revolution Fixit in January 2008; after the Forgeability Fixit in November 2009 and TAP Fixit in March 2010, nearly every project in the company is built using these tools, meaning everybody checks out much less code, syncs, builds, and tests much faster. The new tools are far more strict about enforcing proper dependency declarations and the requirement that that all remotely-built targets have no dependencies on artifacts stored outside of Perforce. Entire classes of problems that used to stop everybody in their tracks—long sync times, long build times, long test times, undeclared dependencies, breakages untraceable within Perforce history—practically no longer exist.

By 10:02, all of the Engineer’s builds were done. None were broken, except for one: Three of the tests for the feature the Engineer had hoped to submit before the end of the day for the weekly release were failing.

Test Speed, Focus, Stability and Hermeticism

The Engineer looked at the failing test results in Sponge, the centralized repository for test results from all test runs done in the company. (Yes, all, every single one. Compared to the size of the Web, however, it’s not that much data.) None of the failing tests were the test he had written for his new feature. He then checks the current status of his project on the Test Automation Platform, the centralized platform for continuous integration builds for nearly all Google projects. He can clearly see two of the same tests passing up until the change at which he’d synchronized his client; clearly his change must be at fault somehow.

The third breakage is the one remaining golden file test. Golden files are still a pain, but far less so than before, now that all of the artifacts are version-controlled using Perforce. They’re not the Engineer’s first choice for an integration test, though; the team has an action item for this quarter to look for ways to replace it with a more meaningful test. But for now, TAP shows that the failure was due to a bad data push checked in by the backend team five changes before the one at which he’d syncronized, which was rolled back just two changes after. He quickly synchronizes his client again, as well as the other two clients from earlier, to make sure he picks up the reverted golden file, and launches builds in each of them to make sure the golden file test passes once again.

He takes another look at the two failing tests. These are good tests: quick, focused on specific pieces of code, and well-written using the latest C++ unit testing and C++ mocking frameworks. (If it weren’t for Google “Keeping It Legal” harassment guidelines and his own orientation, he would arrange a business trip to Kirkland personally kiss Zhanyong Wan’s…hand.) All of the flakiness had been eliminated from his team’s tests a long time ago as well; Testing on the Toilet was still going strong, and it had gotten everybody used to the idea of small, medium, and large tests, spread the word about all the latest testing and design techniques, advertised the latest and greatest build and test tools, and helped pull the whole company together on Fixit days that encouraged everyone to fix broken tests and adopt new tools.

On top of that, the Engineer’s team and practically every other team had adopted the Test Certified practices, either by explicitly participating in the “TC Laddder” or implicitly by just adopting all of its component tools, policies, and practices without enrolling in the program. Like many teams, the Engineer’s team has all tests scoped to a specific size, reasonably high test coverage, no flaky tests, and a policy that everybody update and run the full test suite when making a change. Beyond that, nearly every team now has a TAP build, everything building and testing in parallel in the cloud, and a commitment to keeping their own build passing—which has, after a certain tipping point, translated into keeping everybody’s build passing.

Thanks to the focused, stable nature of the tests and the clear, timely output of all the tools, the Engineer rapidly diagnoses the problem: Part of his new change requires updating a field in an existing protocol buffer, but he missed a couple of places where this protocol buffer field is used by other modules in his project. The code only needs to have a pair of existing if statements updated, so the new values are recognized rather causing an error. An embarrassing mistake, but thankfully a small one. He quickly adds the necessary fixes, re-runs his build and tests, and breathes a small sigh of relief as the tests pass less than seven seconds later.

The Engineer turns around to get his teammate’s attention, and politely asks him to take a look at the last round of the code review that the Engineer had bounced back to him yesterday. His colleague agrees to get to it in just a few minutes, and the Engineer turns back to focus on two other tasks: a bug fix for an issue that had just been reported the day before; and another new feature scheduled for two releases from now.

External Breakage

About ten minutes later, the build status plugin in his browser switches from green to red; somebody has broken the project’s build. He clicks the little red lightbulb icon to go to the TAP project page, and sees that a third of the tests broke at one particular change. He doesn’t notice that a rollback has been submitted yet. He clicks through the TAP pages until he reaches the code review page for the offending change.

Turns out there was a change to a component that the Engineer’s project depends on, that a large number of other projects also depend on. The code review thread, after the approval message, is filled with messages from other engineers across several different projects pointing out that the change had broken their tests, complete with Sponge and TAP links. And at the end, the author of the change has posted that a rollback has been submitted in a new change, with a link to that change. No need for the Engineer to add his own posting to the thread at this point.

Switching back to TAP, the Engineer now sees that the rollback has been picked up by TAP, and all of the affected tests—and only those tests affected by the change—have been run, and most of them have already finished in a passing state. Satisfied, the Engineer turns back to his work.

Submitting the change

The Engineer catches an update in the Gmail tab of his browser. Switching to check his inbox, he sees that his teammate just approved his change. “LGTM; just grammar and style nitpicks. Please submit after addressing.” The Engineer fixes a subject/verb agreement in a comment, removes a heretofore undetected trailing whitespace. He builds and runs all the tests one last time, just to be safe; they all pass, and he’s confident the change is ready.

He executes the command to submit the change to the Perforce source code depot. The Engineer checks the clock; it’s 11:04am. Perforce has detected merge conflicts; incompatible edits to the same files that the Engineer has changed and is trying to submit. One of his teammates had submitted new changes to these files just five minutes earlier. These conflicts are much more numerous and tricky than the ones he dealt with in the morning. He needs to manually address each conflict and edit as necessary, just as before, but these will take significantly more care.

While working on the updates, Google Calendar has pops up with a notice that his 11:30am meeting is starting in ten minutes. Larry, though not expected to attend this particular meeting, would not approve of him being late. It happens to be in the building next door, thankfully, so he doesn’t have to scavenge for a GBike and navigate through traffic. The Engineer grabs his MacBook and leaves.

After the meeting, the Engineer and his teammates go straight to the cafe, and indulge in a delicious gourmet meal, just like any other day at the office, and upon returning to his desk he resumes fixing the merge conflicts. By 1:59pm, the Engineer has resolved the problems with the code and run all the tests again to make sure everything works. All tests passed with no further changes. He asks his teammate to take just one last look at the change; his teammate does so immediately, and seeing that the conflict resolution didn’t fundamentally alter any of the prior code’s semantics, approves the code review one last time.

The Engineer submits his change. By 2:19pm, his project’s TAP build is still passing, and he’s received no angry emails from other projects complaining that his change broke their builds. He decides to go to a tech talk at 2:30pm on the false economy of the performance of fully-compiled languages vs. engineering productivity.

Microkitchens

The Engineer spends the afternoon after his tech talk doing what he loves best: Getting code changed, reviewed, and submitted, with the occasional microkitchen break. At 3:49pm, it was time for the second latte of the day.

There was inevitably a small congress of engineers from nearby projects huddled around the espresso machine and other tables. Before he arrived, there was some chatter about the past weekend at Tahoe; about the paging performance of the memory manager in the latest Linux kernel; comparing the robots that the MIT AI lab is developing now compared to the ones one of the engineers worked on during her time there; about Google employees caught trying to sneak into the Facebook cafe. As the Engineer approached and started making his latte, one of his colleagues from this cluster of conversationalists perked up.

“Hey man, how’s it goin’?”

“Pretty good. Got that feature I’ve been working on the past two months submitted.”

“Awesome! No rollback?”

“Nope. All seems to work as well as we could hope so far. We’ll push the next release to our staging cluster first, of course, run some real-world data through it. But other than that, it’s good to go.”

“Sweeet. Oughta look good on your perf this go-round.”

“How’s your project goin’?”

“Pretty good! We’ve got a new push next week with a few cool features I’m looking forward to seeing announced in the official blog.”

“Nice! Your whole team’s been working pretty hard on that whole redesign, right?”

“Yeah. Don’t know how we would’ve done it before TAP rolled out. It’s sure saved our bacon a number of times.”

“I hear ya. Same here. I lost track of the number of times I didn’t have to go track down a test failure, or roll it back myself. But hey, I gotta run; got a couple more changes I’m hoping to get in by the end of the day.”

“Cool, man, see ya!”

The Engineer carries his warm, full latte back to his desk, slides into his chair, and is back into his work well before he takes the first sip of his beverage. He approves four code reviews and submits five more changes that day: the bug fix reported a day earlier, which required extracting a new strategy class and testing it in isolation; two changes working towards the feature he hopes to submit for the release two weeks from now; an update to a utility script; and a rollback for a teammate’s change that broke a test he didn’t run—ah, gotta love Nooglers.

The End of the Day

Code reviews and readability standards, both incredibly powerful and virtuous instruments, are now augmented by an amazing new set of build tools that could scale with the amount of code being produced, modified, built, and tested. A fairly rapid checkout-edit-compile-test workflow is now standard. Of course, there are still points of friction and issues of scale and communication complexity, but building and testing code—the core activity of the Engineer’s day-to-day work life—were no longer drags on his productivity.

Code gets added. Perforce keeps up. Builds are quick. Tests are quick. Code gets tested. Dependency cruft…well, it still builds.

Afterword

This is a very limited perspective on Google engineering life and culture, but I hope it’s clear that before the Testing Grouplet, Testing Technology, Build Tools, the Fixit Grouplet, and Engineering Productivity really started to ramp up their efforts, there was a ton of friction directly related to the inability of the build and test tools to scale, as well as limited knowledge and practice of automated testing and design techniques, making it difficult to develop and deliver new features rapidly. Eliminating these obstacles over the course of the past five years or so—providing first-rate tools, setting clear and meaningful goals, driving discussion and sharing knowledge—has enabled Google engineering to sustain its rapid growth, expanding portfolio of projects and products, and remarkably high code quality.

These solutions were not developed by the guys at the top, or anyone anywhere near the top. They were not accomplished by any one person or elite clique. They were the products of an ad-hoc, yet close-knit community of passionate, highly-intelligent engineers in an environment that valued and supported organizational, tooling, and process experimentation and innovation.

However, relatively few folks within and outside Google are aware of all the work that was done, and the role played by the Grouplets in particular to bring about the current state of affairs. This series of posts aims to bring these issues, their solutions, and the relevant details into clear focus, for whatever it’s worth.

Summer 2006

Perforce

BUILD language and make

Dependencies

Microkitchens

Test Slowness and Flakiness

Test Granularity and Lack of Hermeticism

Testing on the Toilet

Bugs and Coverage

Submitting the change

The End of the Day

Summer 2011

Perforce

BUILD language and blaze

Test Speed, Focus, Stability and Hermeticism

External Breakage

Submitting the change

Microkitchens

Afterword

`BUILD` language and `make`

`BUILD` language and `blaze`