This is the fourth of five posts in my “whaling” series about the high-level conceptual and cultural challenges the Testing Grouplet and its allies faced, and the knowledge and tools that eventually spread throughout Google engineering that removed the infamous “I don’t have time to test” excuse. This post describes the collection of processes employed by Google for ensuring software quality—including, but not limited to, automated testing.
The first post in this series focused on high-level cultural challenges to the adoption of automated developer testing that emerged as a result of day-to-day development reality. The second post in this series focused on the fundamental object-oriented programming issues which formed the core of most of Google’s testability challenges—and solutions. The third post in this series covered the basics of how automated tests should—and should not—be written. The final post will discuss the specific tools the the Testing Grouplet, Testing Tech, Build Tools and others developed to improve development and testing efficiency and effectiveness.
This post may not contain a complete inventory of processes and roles people play within Google to ensure overall software development and business productivity; for example, I don’t discuss the topics of hardware design and procurement, network security, facilities management, or cafe and microkitchen services at all. Still, I’ve done my best to highlight those processes and roles that, from my viewpoint, most directly impact the productivity of the development process and the quality, reliability, and performance of Google’s software systems and products.
Fact-checks and embellishments from Googlers present and past welcome via email or Google+ comments, as always.
Bug and Feature Tracking
Continuous Builds and Code Coverage
Monitoring and Alerts
Site Reliability Engineering
One Rough Beast
A House Divided
Post Mortems and Retrospectives
Again drawing on my military-town roots, a useful concept from United States Air Force Colonel John Boyd: the OODA Loop, short for “Observe, Orient, Decide, Act”.1 This is specifically a model for the decision-making process in the heat of combat—a life-or-death situation far removed from the cubicle farm, but I find it generalizes well to the process of software development in particular, and the business cycle in general. The idea is that one makes use of available tools and information to assess the present situation, make decisions, and execute—only to repeat the process continuously.2
A more common example of the OODA Loop at work is the act of driving. One is constantly observing external conditions via the windshield, and internal conditions via the speedometer, the presence or absence of engine lights, and various gauges—fuel, RPMs, engine temperature, and the like. Depending on all these inputs, we turn the steering wheel slightly this way, or that; press on the gas a little more, or let up a bit; make constant small adjustments to keep the vehicle moving more or less straight-ahead most of the time, stepping on the brakes and making hard turns relatively infrequently. And, if we aren’t jerks, we use our turn signals to help other drivers make appropriate adjustments in the context of their own OODA Loops.
To me, the value of a software process seems more apparent in the OODA Loop context than an any other. Impact on the bottom line? Hard to draw a straight line to it and measure. Visible improvement in code quality? Though such improvement is tangible, and critical for long-term software development productivity, it’s far from objective. But in terms of overall business productivity, early detection of defects, and rapid rectification of defects discovered later and preventing their recurrence—without introducing a slew of new defects in the process—a good battery of regular development practices proves invaluable in tightening the OODA Loop and staying nimble. From a single, lone-wolf developer to a worldwide, distributed engineering enterprise, the right kind and right amount of process at the right time can make the difference between getting things done and getting bogged down by excessive procedures or excessive defects.
No single practice is sufficient, but given the right balance of processes, each one is necessary to achieve the optimum balance to keep the OODA Loop clamped down to its minimum possible circumference for a given development situation. Most of what I’ve written about has been automated developer testing at Google, as that tool was missing from the balance and it was starting to show. For completeness’s sake, I’ll now talk a little about all of the other necessary practices that Google instituted before widespread adoption of automated testing, and which it still can’t do without to this day.
Of course, a lot of the processes I’m about to talk about sound like a lot of overhead, and overkill for a great deal of software projects. Indeed, it is all overhead; the point, though, is that it’s often necessary overhead, the cost of doing business in a certain way, to ensure that the individual actors in this environment are able to get the information they need, and to transmit their own information to others who need it in as clear and efficient a manner as possible. None of this overhead is for its own sake, for the purpose of getting warm fuzzies by checking off items on a finite list of issues and responsibilities.
There is, of course, the risk that any of these precautionary measures can be overdone. Yes, they sometimes are. Yes, there’s probably a lot of fat that could be trimmed off of any number of the below processes as Google currently implements them. But as a whole, Google is known for running a pretty tight and reliable ship, and erring slightly on the side of precaution hasn’t hurt its business—in fact, I argue that this combination of practices (including, but not limited to, automated developer testing) is precisely what allows Google to continue to grow and experiment and scale without endangering user trust and the revenue stream.
None of this, not automated developer testing, nor any of the other practices I describe below, should be interpreted as a prescription for a development team or software company, to be followed to the letter. But I do aim to highlight many of the unsung practices that have made Google engineering as successful as it’s been, for the sake of imparting knowledge and inspiration to others who may be facing issues with communication, efficiency, productivity, and code or product quality. Take away what you feel would be helpful; leave the rest.
This seems almost too obvious to mention, but having a system for managing changes to a software system and browsing their history is a must for any significant development effort, even for a single developer. Google’s use of Perforce for source control management is well-documented; it keeps the entire history of Google’s (mainly) single source repository model in a nice straight line, with individual changes pegged to strictly increasing changelist numbers, which are tied to code review threads (discussed below), feature/bug issues (also discussed below), and test results (to be discussed in the next post).3 This history goes all the way back to the very beginning of the company—that’s not strictly useful in a day-to-day context, but it’s pretty useful in diagnosing issues, gaining insight into design and implementation decisions, and doing research into why a particular piece of code evolved the way it did in the hopes of repeating the successes and avoiding the mistakes of the past.
Of course, Google has an internal tool for easily navigating the Perforce repository via a web browser interface. This is critical in navigating through the latest version of the code, in addition to navigating through its history, and having open discussions about it—or just satisfying one’s natural curiosity.
A happy development from a couple of years before I left was the marriage of Perforce with git, whereby an internal tool allowed one to manage a personal git repository based on a Perforce changelist. It had a bit of a learning curve, but once I got used to it, it was invaluable in setting up a series of related changes in parallel, making progress on the entire series while waiting for the code reviews to finish one-at-a-time. With the first code review done, the next flies out immediately. One could manage the same thing with multiple Perforce clients and patching between them, but using git made it so much easier, and put less load on the Perforce server.
This marriage of the two systems was bliss, as far as I was concerned. There’s no way git could scale to the size of Google’s Perforce repository, and though there might be a way of effectively managing a distributed repository internally, having increasing changelist numbers to refer to made it so easy to grasp the relative timeline of changes; to check if your working copy was the latest-and-greatest or out-of-date; and to see if your binaries had a criticial feature, bug, or fix. Yet at the same time, just for tracking your own changes, git unlocked you from the strict sequential progress imposed by the Perforce model.4 When your own personal OODA Loop tightens, that of your entire team does, too. And when that happens, the entire company benefits.
Code reviews aid code quality immensely, for several reasons. We’re likely to do a better job the first time if we know someone else will see the code before we check it in. Reviewing each others’ code cultivates knowledge sharing across a team or company, enabling new members to get up to speed quickly (thanks to the implied mentoring) and reducing the risk inherent in a single person being familiar with the bulk of a feature’s or a project’s knowledge. Given systems such as Google’s Mondrian (or Guido van Rossum’s open-source version, Rietveld), code reviews leave an audit trail for the logical and physical history of the code, which can be extraordinarily helpful for reviewing design decisions and diagnosing bugs—or writing post-mortems (root cause analysis of serious system failures, described below).
Granted, code reviews introduce a delay into the development process that would seem to fight against OODA Loop optimization. But that’s looking at things short-term and small-scale. Code reviews are about knowledge transfer at least as much as they are quality assurance, and encouraging rapid and widespread knowledge transfer across a team or company encourages long-term, large-scale OODA Loop optimization since more individuals become capable of making deep changes to the code. That, and having a documented history means that making changes to old code, either to add new features or fix bugs, can take dramatically less time, since the engineer tasked to make the change can come up-to-speed quickly regarding particular design and implementation decisions that helped shape the existing code or system, before directly engaging any other code authors/owners in the process.
Many Agile software development enthusiasts prefer to practice pair programming, whereby there are always two programmers working on the same piece of code at the same time, to achieve many of the same benefits. I’ve tried pair programming, and liked it a lot, but haven’t been a steady practitioner or advocate, and I don’t think it’s a complete substitute for documented code reviews. The combination of the two, however, can be very powerful, and some Google teams indeed do both. As for me, just sitting in close proximity with my teammates, where we can easily and frequently turn around to ask questions and stare at code together when we need to, has proved beneficial enough.5
If pair programming is your thing, though, awesome! Do it! But I’d suggest giving documented code reviews a try, too, if you ever reach a point where memories about important design decisions become fuzzy, and that begins to hurt your quality and/or productivity.
Coding standards help short-circuit a lot of unproductive debate and ensures that everyone’s eyes are tuned to the same idioms, which further streamlines development and discussion. A number of Google’s own language-specific coding standards are openly available.6 Herb Sutter and Andrei Alexandrescu also published a very helpful C++-specific book in 2004, C++ Coding Standards: 101 Rules, Guidelines, and Best Practices.
Within Google, the coding style guides and the code review process go hand-in-hand, and have existed within the company from nearly the very beginning. I’m willing to hypothesize that, absent widespread automated testing, style guides and code reviews were the elements of Google engineering culture that allowed its single source code repository-based, multiple-site development model to scale as much as it could before automated testing became widespread, as described in Coding and Testing at Google, 2006 vs. 2011. Even with the advent of widespread automated testing, Google engineering without style guide standards and code reviews today would be a disaster.
The language-specific style guides are maintained by a handful of experts in each language. Before a Google engineer can submit changes to code written in a specific language, that engineer must either have passed his/her “readability review” for that language, or have his/her changes approved by someone who already has “readability” in that language. The readability reviews are managed and executed by volunteer members of the Readability Grouplet—as I’ve said before, one of the most hard-working, immensely valuble, underrecognized and underappreciated teams in the whole company.7
In the case of the Google C++ style guide, there are a couple of details that folks might find peculiar. For example, only
const references are allowed; variables passed as parameters to functions which will modify those variables must be passed as pointers. There are also strict rules about function comments that require memory management details be spelled out. Features such as these are Google’s way of expressing memory access and ownership semantics via convention, since the language has no direct support for the full range of such semantic expression. Thanks to everyone being indoctrinated into these (and other) conventions by the Readability Review process, as well as by other engineers who’ve been around for a while and are very used to them, the standard style guide works surprisingly well to address the management of these semantic complexities.
Also, the Google C++ style guide forbids the use of exceptions; having absorbed the Gospel of Herb Sutter’s Exceptional C++ series, and applying it with great success at Northrop Grumman Mission Systems, I was exceptionally sad to discover this fact when I joined Google in 2005.8 However, much of Google’s core infrastructure software was written shortly after 1998, when the first C++ standard was just established but not widely supported. As a result, as jwz famously pointed out in Peter Seibel’s Coders At Work, at the time, it was difficult for anyone to “ever agree on which ten percent of the language is safe to use”. Add to that the fact that the early Google engineers were comfortable with error-handling styles absent exceptions, as well as with programming styles that weren’t dependent on other C++ language features, and it’s easy to see why Google engineers made the decisions they did.
With exceptions in particular, you can’t just go back over the existing codebase and sprinkle them in in the same iterative fashion in which one might extract new classes from existing ones, so there is reasonable justification for their continued ban. Plus, just like in the case of memory access and management semantics expressed purely via coding convention, these other error-handling strategies are as natural to Google C++ programmers as are the keystroke commands of their chosen code editor.
The real punchline is this: These company-wide style guides/coding standards not only meant that individual developers and teams were more productive, but that individual developers could read, understand, extend, and fix code in the same language but which belonged to other products/systems. This meant that anyone in the company could not only trace through code belonging to other teams to help diagnose issues, but could even submit their own changes to this foreign code with a minimum of friction, meaning that improvements to common infrastructure were relatively easy to make and could be distributed to users of that infrastructure very quickly.9 Plus, when an engineer changes to a new team that works in the same language, there’s no overhead of relearning an entirely new set of idiomatic conventions (product domain-specific idioms and patterns notwithstanding).
Even if your company isn’t Google-sized, coding standards and code reviews are excellent habits to adopt, in addition to automated testing, if you feel your team is getting bogged down in unproductive debates or getting bitten repeatedly by classes of bugs that could be avoided by consistent use of language features and idioms.
Bug and Feature Tracking
Bug and feature tracking is necessary on some level for nearly all projects, though there’s definitely at point at which such tracking needs to be transferred from a single person’s head to some persistent, shareable format; and then from a basic tool to a scalable, distributed solution that cross-references issues with source control references, individual developers and teams, and other issue instances within the system. Suffice it to say, Google has such an internal system of the lattermost breed.
Programmers universally highlight the critical role good documentation plays in the development process. They’re also generally horrible at providing sufficient documentation for their own programs, tools, and systems. Any medium that can lower the bar for programmers to write their own documentation and keep it current—or to update the documentation of others as necessary—is a win for knowledge transfer and productivity.
Early on, Google set up a company-wide wiki—the same sort of technology from which Wikipedia gets its name—to encourage lightweight document authorship and maintenance. Wiki pages are written using a plaintext-based shorthand, rather than raw HTML, which the wiki server then translates into proper HTML for display in a web browser. This allows programmers to write more thorough documents with less effort, which benefits both the audience for the documents as well as the author him/herself, as centrally-accessible documentation scales far better than face-to-face or email-based questions and answers.10
Towards the end of my time at Google, Google Sites began to be preferred for internal documentation over the wiki, as part of the overall cultural emphasis on dogfooding. It employs word processor-like formatting rather than plaintext shorthand a la wiki, which perhaps some folks find more pleasant or natural. Personally, I lament the declining usage of the wiki, but so be it.
For many years, Google prescribed that “Design Docs” be written for significant systems or components in anticipation of their implementation, such that the community of engineers could become familiar with the intent of such a component/system and provide helpful feedback. Over time, this model proved somewhat unwieldy, and the Documentation Grouplet worked to establish new standards that encouraged lighter-weight design review processes. Still, in my experience, most design discussions made use of whiteboards and wiki/Sites pages rather than formal design documents. Regardless, the process of writing down and openly reviewing software designs is eternally valuable for clarifying thinking and working through potential issues, no matter how formal or informal the actual process.
Speaking of the Documentation Grouplet, they also worked to improve internal documentation via promotions such as the “Doc Fixit in a Box”, which is a documenation-focused, team-sized fixlet intended to help a team easily adopt good documentation practices and improve existing documentation very quickly. They also developed a technology known as “Codewalks”, a system whereby a snippet of code is associated with a block of documentation for illustrative purposes. There are actually several Codewalks available on the Go language documentation page.
Somewhat naturally, the Documentation Grouplet is comprised of a large number of technical writers. Trust me, these guys and gals are worth their weight in gold. Some of the largest, most complex internal systems and interfaces at Google would be nearly inscrutable without their expertise, and I personally put tech writers Jessica Tomechak and Andy Watson Orion on a pedestal for their critical role in the success of several large Fixits I organized.
Continuous Builds and Code Coverage
A significant, if not the most significant, portion of the value of automated tests comes from running them for every change to the code. The purpose of a continuous build (aka “continuous integration”) system is to make sure every change submitted to source control is tested, relieving the burden from individual engineers of running all the tests all the time, and to catch integration problems long before a Release Engineer later runs the test suite as a final check before pushing a new build to production.
At the same time, tests are most useful if they actually exercise the majority of the production code, each test at the right level of focus for the behavior it aims to validate. An always-passing test suite that only covers 20% of the code is not so helpful; neither is a test suite that covers 80% of the code, primarily through very high-level smoke tests that do not tickle the dark corners of the application, or gloss over or hide what would be legitimate failures.
Ideally, the most value is gained when only those tests affected by a particular code change are run, i.e. those tests which actually execute the changed code. This helps ensure that the OODA Loop remains as tight as theoretically possible from an automated testing standpoint. This is what Blaze eventually accomplished, and what the Blaze-based Test Automation Platform achieved on a company-wide scale—eventually supplanting the extremely popular Chris/Jay Continuous Build System.
For large-scale production systems, one would be remiss to not have a separate environment for running new release candidate binaries for a time before promoting them to production, where they will serve live traffic or process data on which live traffic depends. Smaller systems, or systems which don’t have a lot of users who can easily take their business and advertising dollars elsewhere, may not need such an expensive, heavyweight middle-ground between development and production. But there are, admittedly, some issues that no amount of automated developer testing will easily or efficiently catch—large-scale performance and memory issues among them—and staging environments are the last line of defense from a development perspective for shaking bugs out of a system before users get a hold of it.
Such environments are also critical for performing large-scale migrations between internal systems or data formats, to not only look for bugs and performance impact, but also to analyze and explain any differences before committing such updates to production. I was once tasked with running a large set of documents through a deprecated component and its replacement, which Adam Sawyer and I worked on together, to identify and explain any differences between the two. There were a few, but a couple of them took some time to track down and explain. Once that was done, however, we decided all the differences were acceptable, and went forward with the deployment. But just pushing ahead with it absent a staging run could have potentially been suicidal from a business standpoint, should the resulting websearch index prove inconsistent, impacting search quality, and likely digging into advertising revenue.
Release Engineering deals with the management of production binary deployments. It involves closely tracking what versions of a binary are released, what features and bugs are present in those binaries, verifying new binaries for upcoming releases by running all automated tests and ensuring that they all pass, and actually performing the “push” of new binaries to the production data centers—and rolling them back should problems arise.
Google has a team of dedicated Release Engineers, but has far, far more projects that require Release Engineering services than there are Release Engineers to service them. Consequently, Google Rel Eng has adopted a largely “self-service” model, whereby they develop and maintain standard internal tools and documentation, which individual projects can then use to manage their own releases. Of course, the Rel Eng team is available to respond to questions and issues any team may have. Effectively, in most cases, the developers of a given system are often their own Release Engineers. This model scales and works surprisingly well. And for me, personally, it was always such a rush to have a hand in pushing new updates to production Google websearch services—taking responsibility for shepherding a new release was actually kinda fun!
Monitoring and Alerts
Google is famous for its extreme reliance on massive quantities of data when making business decisions. That same reliance extends, naturally, to data center operations—really, it’s possible that the reliance on operational data preceded the reliance on business data, though I don’t know for sure. The point being, to maintain a healthy fleet of servers, one must take steps to write hooks in those servers to publish data revealing critical aspects of its behavior: amount of CPU, memory, disk space, and network bandwidth used; the number of user requests handled per second, and the average latency of those requests; the number of errors encountered, classified by each type of error; etc.
This data is then used for two critical applications: monitoring and alerts. Monitoring involves processing this data and producing graphs, charts, and tables that are human-friendly and reveal trends in overall system behavior, preferably with some degree of history. Alerts are just what they sound like: Triggers that fire under certain circumstances that bring critical issues to the attention of whoever is on call to respond to them. Monitoring can be used to observe the behavior of a system in a staging area, or in production shortly after a new release, to identify potential issues even before alerts fire. When alerts do fire, monitoring dashboards are critical to gaining a bigger-picture insight into the state of the overall system, and to determining whether or not an alert-firing issue has been satisfactorily resolved.
Observe, Orient, Decide, Act. Pretty hard to decide what to do about a production problem without useful data to help diagnose the issue—and identify when it has been satisfactorily resolved.
Monitoring and alerts may seem like overkill for smaller services, and probably is for early demos and prototypes. I’m sure the first prototype of the Ford Model T lacked a speedometer and oil pressure light, as well as fuel, RPM, and engine temperature gauges. Leaving them out at that stage of development was probably the right thing to do. Leaving them out in perpetuity, not so much.
Site Reliability Engineering
Site Reliability Engineering (SRE) is the team that performs the system admistration work necessary to ensure that production services continue to operate in a smooth and efficient manner. They’re the ones that carry the pagers and are on-call at all hours to respond to production emergencies. They are also amongst the most gregarious and fun-loving teams in the entire company.
There are more SREs than Release Engineers, though still not enough to go around to every team. To ease the load—and potentially recruit new members—the SRE team instituted a program whereby existing software engineers within Google could take a tour-of-duty with SRE, with a modest boost in incentives, to learn the ins-and-outs of the production environment, the software issues that lead to production problems, and the software practices that ensure smooth and scalable operation in production. Many engineers who took the tour actually converted to full-time SREs.
On websearch, we were fortunate enough to have dedicated SREs, and the dynamic between development and SRE was healthy and enlightening; we wanted to push new features, and they wanted to make sure such new features didn’t explode in production, and that existing issues would eventually be addressed to prevent production issues—and those unpleasant 3 a.m. pager alerts. They pressed us to collect the data necessary to monitor system health, and helped us to develop informative graphical dashboards and meaningful, stable alerts based on that data. They were always present in our weekly planning meetings, to discuss any issues that arose week-to-week, and to plan tasks to address those issues and other long-term system health concerns. They were, as ’twere, the mirror up to nature that we needed to make sure our development—and the users’ satisfaction, and the company’s revenue—didn’t stall out due to endless production fires.
Of course, we were fortunate; not every team has SREs assigned. Those folks have to carry their own pagers.11
Sometimes, despite automated testing, thorough code reviews, high code coverage, continuous integration, ample documentation, staging releases, a standard release process, and SRE support, bugs still happen. That doesn’t necessarily mean the other processes failed, though improvement is always possible. It usually just means that software is hard, and you have to be prepared to handle the problems that slip through despite preparation and diligence.
The upside is, because of all those other mechanisms and precautions and procedures, relatively few show-stopper bugs filter through to production, and fixing those bugs involves relatively little risk, fear, and complication, and is usually quite quick. Usually. Let me offer a pair of bugs from my personal experience on websearch—one user-visible, and one not—to illustrate how such bugs can filter through, and how we deal with them.
One Rough Beast
And what rough beast, its hour come round at last,
Slouches towards Bethlehem to be born?
—William Butler Yeats, The Second Coming
First, the user-invisible bug. Occasionally a document shows up on the web that the Google parser can’t handle. This is an ultra-microscopically small percentage of the documents on the web, but when one is encountered, it is immediately blacklisted to prevent production services from processing it and the system is immediately updated to handle all documents exhibiting the offending quality from that point forward, with tests to prevent regressions. Still, it happens sometimes.
In this case, there was a document that caused whatever server unlucky enough to process it to crash. Alerts fired. Once that document was identified, it was temporarily blacklisted, and I brought up my own server in our staging data center to process it. Before sending the document through, I attached a debugger (
gdb) to the remote process so that I could observe the crash right where it happens (as opposed to milling through a core dump with dozens, not hundreds of threads after-the-fact). I sent the document through, and boom! The debugger stopped with an excessively large series of stack frames pointing to the offending code: A date-parsing function about as old as the company itself that calls itself recursively.
What happened was, this particular document contained a very long string of identical characters which triggered the recursion. The length of this string was just over the threshold such that the thread trying to parse a date out of the page blew its stack, causing the whole server to crash. In the entire history of the World Wide Web, nothing like this had ever been seen before, apparently. I wrote a very small unit test to reproduce this behavior, and got it to pass with a very small, easy change, effectively removing the length of such strings as a factor when attempting to parse dates out of a document.
All the existing tests passed, the change was approved and submitted, nobody reported any build breakages, a new release was pushed with the fix, and the document was taken off the blacklist. Unit testing in advance didn’t anticipate this, but it took on the order of 13 or 14 years for this bug to manifest! Once it did, however, finding, isolating, diagnosing, and fixing the bug was relatively straightforward and quick given all the other processes and tools at our disposal, and a unit test could easily reproduce the bug and prevent its regression in the future.
Now, no matter if there’s hundreds or billions of such identical characters, this particular bug will no longer appear. It’s dead, dead, dead.
A House Divided
And now, the most exciting bug I personally encountered on websearch: The user-visible bug related to the use of the
<link rel="canonical"> tag whereby the “canonical” URL specified in the tag points to the same URL of the page itself. Let me be clear up-front: Google policy is that this is not only a valid use, but an encouraged use of this feature. However, internally, there was a very slight breakdown in communication between the team that maintains the link-rel-canonical feature and our indexing team, which integrates the code implementing this feature into our system.
After some time, the link-rel-canonical-to-self URL began to be passed back from the canonical parsing code, whereas it previously wasn’t. The precise semantics behind this feature are rather subtle, and it’s not my place to discuss them here, but what happened was that my team’s system, the system responsible for indexing all the newest content discovered on the web on a continuous basis, as well as for indexing popular “hub” sites that often link to new web pages frequently, would refuse to process documents whose link-rel-canonical URL matched its own. Our code explicitly guarded against this condition, which we believed at the time could “never happen”, because the ultimate effects of having the link-rel-canonical URL match the original URL were unknown from our point of view; much of our system integrated indexing code from many other teams with which we were not and could not be completely familiar. So, when we pushed a new release with this new behavior, we suddenly stopped indexing many blogs and other popular sites, such as CNN.
Nobody actually noticed this until Matt McGee’s post on the popular websearch watchdog blog Search Engine Land on Friday, October 22, 2010: Is Google Broken? Sites Big & Small Seeing Indexing Problems. Though our team, as most Google teams, avoid pushing new releases on Fridays like the plague for the sake of our own sanity and the goodwill of our SREs, we were getting called on this error in a public way, and people were becoming wary of using a Google feature in a fashion we actually encouraged. We had to fix it right then.
So I actually wrote a two-line “fix” to drop the identical link-rel-canonical URL in our system, to match previous expectations and behavior. We pushed a new release with that fix late Friday/early Saturday, and things went back to normal. Eventually, we experimented with removing the restriction in our system altogether, after conferring with the link-rel-canonical folks and running a new release in our staging area to confirm no obvious ill effects. We eventually pushed this update to production, and all has been happily-ever-after.
The point? No process is infallible. Bugs will get through. But how big? How many? And for how long? Sufficient tools and processes greatly restrict the amount of defects that filter through to production, and under those circumstances, when bugs and emergencies do arise, people have the time and freedom to quickly and accurately diagnose and fix such issues, without worrying about fifty other issues at once, or worrying about introducing five hundred more with the new “fix”. You have a nice, tight OODA Loop whereby the problem gets fixed quickly, thoroughly, and with minimal negative impact, as opposed to an agonizing death spiral whereby disaster after disaster continues to compound.
Post Mortems and Retrospectives
When bugs trickle through that do have a significant negative impact, particularly user-visible ones, it’s common in the software industry to produce what’s called a “post-mortem”, which is basically a root-cause-analysis report which identifies the timeline of the introduction of a defect and its eventual resolution, as well as significant factors that produced both the defect and its resolution. The goal is to identify concrete actions to avoid a repeat of such defects, be that in the form of processes or the development and/or application of appropriate tools. With so many processes and systems in-place at Google to closely account for changes to code and systems in a highly-visible, highly-accessible fashion, Google has a really good track record of quickly diagnosing and responding to mistakes and not repeating them. Even if an otherwise-tight OODA Loop failed to prevent a negative outcome in one instance, one can still take feedback from the experience to better orient everyone in the future, leading to informed decisions and actions that ensure the lessons of the past are well-applied.
In the Agile tradition, the term “post-mortem” is replaced by “retrospective”, and retrospectives are held on a regular basis to identify the root causes of both successes and failures. Google is not 100% agile in this regard—some teams are, but far from all—but Google engineering has a very strong culture of open (internal) communication and knowledge-sharing that performs the same function to a large extent. On my particular websearch team, regular review and discussion of production issues, both positive and negative, no matter how major or minor, with members of the SRE team present, took place at the beginning of every weekly meeting. Consequently, full-blown post-mortems were rarely necessary, and we effectively operated in the spirit of regular Agile retrospectives, if not following Agile procedures to the letter; our OODA Loop felt just as tight.
Outside of the normal process of day-to-day development and production management, Google also has a long-standing tradition of Fixits: Days—or, in some cases, weeks—set aside for all engineers on a team, in an office, in a department, or in the entire company to address issues that are deemed “important, but not urgent”. Themes include: fixing/writing unit tests; rolling out new and improved internal development tools; improving documentation; working through a backlog of “user-happiness” issues; working through a backlog of “wish list” issues for a particular team. Participation is usually voluntary, and the guidance and resources needed to run a successful fixit are provided by the Fixit Grouplet.
Fixits have multiple benefits beyond that of just taking care of neglected business or migrating everyone to new tools. Among them:
- A Fixit event helps to punctuate an ongoing effort with concrete objectives, deadlines, and milestones, generating energy and lifting morale in the process.
- A Fixit often focuses a critical mass of attention on an issue, rapidly building expertise and disseminating knowledge across a team or a company, and provides the opportunity to make broad assistance available to anyone who needs it.
- The network effect of fixit organization, anticipation, and participation enables big changes to happen quickly, producing quantum leaps in productivity by enabling the team or the company to turn on a dime towards a new and exciting direction.
- Large-scale fixits help maintain, distribute, and evolve the overall Google engineering culture, as they provide one of the few opportunities for all engineers in all offices to focus on a single, well-defined goal for a period of time.
- Despite all the hard work and nervous exhaustion on behalf of the Fixit organizers, fixits are just plan, good ol’ fashioned fun. The sense of being all-in-it-together was infectuous, and the sense of play across teams and across the company was beautiful to experience.
- There’s also a great deal of pride one can take in effecting a big, concrete change in a company that has changed the world—and continues to do so—without waiting for permission or an executive order before taking bold, decisive action that leaves the entire company better off.
After my next and final “whaling” post, I’ll get back to storytelling, relating the tale of the multiple fixits I organized or in which I was otherwise involved to some degree. In fact, this whole “whaling” series was originally a digression from the Fixit story, so that certain objectives and concepts that I will discuss have some background I can point to for more context, without bloating the story itself with excessive details.
I know the pieces fit cuz I watched them tumble down
No fault, none to blame it doesn’t mean I don’t desire to
Point the finger, blame the other, watch the temple topple over.
To bring the pieces back together, rediscover communication
Beyond all other processes and tools and systems for providing feedback, fostering open communication around concrete, shared goals is critical, and a successful team or company needs tools to support that, too. The right tools really depend on an individual team’s dynamic, but I’ve experienced success in the past with a set of “project spreadsheets” with specific tasks and proposed deadlines, reviewed in weekly meetings.12 You could also try the information radiator (e.g. whiteboard with sticky notes) + daily standup route, which is more in the agile vein. You might also benefit from team-wide “braindumps” where team members take turns briefing the rest of the team on specific features they work on, or systems they’re expert in using, or development tools and techniques they’ve found helpful.13
A lack of buy-in on shared goals and a lack of open, healthy communication across the team is the number one killer of productivity in my view, no matter what other high-quality processes are in place. That often leans to cramped deadlines and difficult-to-understand-and-or-test components, in turn leading to production issues and a death march situation. If communication is an issue for your team, I’d advise solving that problem absolutely first before worrying about any tooling-related issues…
…unless, of course, research suggests that a lack of adequate tooling is part of the root cause of any ineffective communication. That’s what happened at Google, when the Testing Grouplet pushed Testing on the Toilet and Test Certified into the world and learned that folks felt that they "didn’t have time to test". Rolling out new build and test infrastructure, with improvements in code editors and unit testing frameworks, made it easier to do the right thing, lowering the resistance to adoption of automated testing practices—which in turn contributed to Google’s unprecedented degree of code quality and system reliability in the face of an ever-growing engineering organization and an ever-expanding portfolio of complex, highly-integrated products and services.
And now, finally, we’re ready for the last “whaling” post, where I talk about a few of these miracle-tools, these shiny brass (if not silver) bullets that really did take Google’s automated testing discipline and productivity up a quantum notch.
1Hat tip to Sean Bergeron for getting me to think about this again via his recent blog posts. I’d independently stumbled upon this while stationed as a Test Mercenary in New York in 2008, a few months before retiring from the testing scene and joining websearch.
2Sounds kinda “agile”, doesn’t it? Boyd reportedly developed the original concept during the Korean War in the early 1950s.
3ThoughtWorker Paul Hammant, a fellow Test Mercenary, thought Google was batshit crazy for developing all of its products from the trunk of a single, company-wide source repository. I always thought it was awesome that Google could make it work, though I could see both the pros and cons. In the pro column, in addition to the advantages mentioned above, this model makes it easier for everybody in the company to adhere to the same coding idioms, find and fix bugs, and easily transfer between projects without having to learn new coding standards and conventions all over again (unless switching to a project written in a different language). In the con column, it’s easy to introduce bugs company-wide that break hundreds of tests and continuous integration builds or, even worse, that get built into dozens of releases before a rollback or fix is applied; this is especially a problem for core libraries which provide common infrastructure for massive quantities of Google services. Still, I still think the pros outweigh the cons, and Google developed the tools needed to allow this model to scale to thousands of developers working on massive amounts of code across dozens of engineering offices throughout the world—though they have also, in the past couple of years, developed a means of mitigating the risk of frequent changes to core infrastructure, layered on top of the single-source repository model. That, and a very small handful of projects have been split into separate source repositories
4I was in the audience for Linus Torvalds’s presentation on git at Google. On the one hand, I thought the ideas were brilliant. On the other, I thought it was somewhat naïve that he considered the Linux kernel a “large” project relative to what Google had going; in and of itself, it is large, but the git repository model just isn’t practical for the amount of code Google has in a single repository, nor for the amount of changes Google produces within that repository. On the other other hand, I thought his great ideas didn’t necessitate him being such an asshole when talking about them. Fortunately they were good enough that his personality didn’t completely obstruct them, but it was an unnecessary hindrance.
5As an illustrative example, one time, I overheard Andrew Cunningham ask Matt Russotto a question about specifying covariant return types in C++ while I was having a completely separate conversation with Shibiao Lin. Upon finishing with Shibiao, I asked Andrew, who had primarily been a Java programmer before joining websearch, to show me the code that was causing a compile error. I pointed out that he was missing a
const modifier, and suddenly, his covariant return type was acceptable to the compiler. I’d been bitten by that years earlier while working at Northrop Grumman Mission Systems.
6Huh? Where’s Java?
7Yes, there is friction with this model. The Readability Grouplet is usually short on volunteers, despite the cultural value of the Readability tradition and the increasing number of Nooglers and interns joining the company, so scheduling a readability review and seeing it through to completion often takes a longer-than-comfortable amount of time. But I don’t see this as a problem with the model, but as a problem with Google’s company-wide incentive structure, whereby volunteer Grouplet contributions are undervalued in the context of recognition and promotions, undermining the motivation for volunteers to participate in the Readability Grouplet.
8Despite the ban on C++ exceptions, I still found it very valuable to write code with exception safety in mind, since so much of it is about guaranteeing the integrity of data when operations fail prior to their completion. It made code easier to reason about in general, making it easier to test, and fellow Test Mercenary Jeffrey Yasskin remarked to me once that he felt it made multithreading issues easier to reason about, too.
9One of my proudest minor achievements was tracking down a bug in Bigtable whereby taking out a row lock would not prevent modifications to individual columns. I didn’t submit a fix, but I wrote a test to demonstrate the bug concisely, clearly, and consistently, and it was fixed soon after that.
11For more on the tools and practices involved in monitoring, alerts, and large-scale web operations in general—not from a Google-specific perspective, though most certainly a comparable one—see O’Reilly’s Web Operations book.
12I stole this procedure from Mathieu Gagne, my first manager on websearch, and applied it successfully to Fixit organization as well.
13Another idea stolen from Mathieu Gagne.