Test Mercenary (Slight Return)

I've published a new article on AutoTest Central about the Testing Grouplet's Small, Medium, and Large test size schema, and more on the Apple bug.

04 Mar 2014 - Boston
Tags: Apple, AutoTest Central, Google, Test Mercenaries, goto fail, programming, technical

I’ve published a new article on AutoTest Central about the Testing Grouplet’s Small, Medium, and Large test sizes. This is the second and last article I already had in the can from last fall, and I’m happy to have it see the light of day. Yes, most of it is lifted from the Small, Medium, Large post on this blog—as I’ve said, I’m a big advocate of repeating myself, and saying the same thing more than once—but is much abridged and even contains a hand-crafted illustration mimicing the slick, polished one I no longer have access to from the Testing Grouplet/EngEDU Noogler unit testing lecture.

But speaking of posting on AutoTest Central, in my obsessive Googling for articles related to the Apple SSL security bug this past week—because I do care about my own article’s search ranking, just a pinch, I admit—I still haven’t found any other articles that suggest, as mine did, that the same block of code cut-and-pasted six times in the same file was a problem, that it was ripe for factoring out, and that the resulting function was straightforward to test in isolation. That’s curious to me; it’s like people got stuck on the one stupid goto fail; line and started drawing conclusions without looking at the surrounding code, seeing the same algorithm copied immediately above it, and suspecting, as I did, that there was a classic code duplication problem which fixing would’ve likely avoided the bug to begin with, test or no.

What follows is an exploration of my bewilderment, documented in excruciating detail. I don’t mean to take up all your sweet time, but I’ll give it right back to you one of these days.

UPDATE: I almost never update posts after-the-fact, but I couldn’t leave well enough alone, and I produced a set of slides based on this blog entry titled Finding More than One of the Same Worm in the Apple. I’m hoping to present them at an upcoming Automated Testing Boston Meetup event, but if anyone gets the itch to pass ’em around, please do feel free!

What’s That Smell?
Goto Fixation Considered Harmful
Is It Me?
Quick Aside About GnuTLS
Le Sigh
Footnotes

What’s That Smell?

In An Extraordinary Kind of Stupid, David Auerbach¹ does a fantastic job breaking down the code such that non-coders (I imagine) can clearly see and understand the source of the error, and its severity. He even argues, which I totally buy, that the bug was likely due to an automated merge gone awry, based on this diff. But then he goes on to say in his conclusion (emphasis mine):

Preventing bugs like these is one of the biggest challenges of software engineering, and this incident should make it pretty damn clear why. A single extra line of code compromised the security of millions and millions, and no one caught it for more than a year.

Even in a rigorous environment of code reviews, automated testing, and high-quality development, bugs like this do slip through. But the sheer complexity of today’s code makes it very difficult to catch everything. Some people have bashed on the code, saying that it’s too sloppy and careless. While I’m not thrilled with all the stylistic and structural choices, I think the code is reasonably good by today’s standards. Apple wouldn’t have released the code as open source if it weren’t good, and even if they had, there would have been quite an outcry from the open-source community if they’d looked it over and found it to be garbage. The people who wrote it knew what they were doing. The bug is indeed the result of negligence, but the sad and scary thing is that, for software developers, this is not a “What were they thinking?” moment. It’s an “Oh my God, that could have been me” moment.

Of course I agree with the bulk of the above, except the “the code is reasonably good” bit. To quote from SSLVerifySignedServerKeyExchange() in the buggy version of sslKeyExchange.c:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
if(isRsa) {
  /* skip this if signing with DSA */
  dataToSign = hashes;
  dataToSignLen = SSL_SHA1_DIGEST_LEN + SSL_MD5_DIGEST_LEN;
  hashOut.data = hashes;
  hashOut.length = SSL_MD5_DIGEST_LEN;

  if ((err = ReadyHash(&SSLHashMD5, &hashCtx)) != 0)
    goto fail;
  if ((err = SSLHashMD5.update(&hashCtx, &clientRandom)) != 0)
    goto fail;
  if ((err = SSLHashMD5.update(&hashCtx, &serverRandom)) != 0)
    goto fail;
  if ((err = SSLHashMD5.update(&hashCtx, &signedParams)) != 0)
    goto fail;
  if ((err = SSLHashMD5.final(&hashCtx, &hashOut)) != 0)
    goto fail;
}
else {
  /* DSA, ECDSA - just use the SHA1 hash */
  dataToSign = &hashes[SSL_MD5_DIGEST_LEN];
  dataToSignLen = SSL_SHA1_DIGEST_LEN;
}

hashOut.data = hashes + SSL_MD5_DIGEST_LEN;
  hashOut.length = SSL_SHA1_DIGEST_LEN;
  if ((err = SSLFreeBuffer(&hashCtx)) != 0)
      goto fail;

  if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)
      goto fail;
  if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)
      goto fail;
  if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
      goto fail;
  if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
      goto fail;
      goto fail;
  if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
      goto fail;

It’s not that I think it’s hands down the worst I’ve ever seen, but duplicate code (i.e. lines 8-17 compared to lines 30-40 in the notorious example above) is one of the more obvious and easily-remedied code smells, and chances are if that handshake algorithm had been factored out as I demonstrated, the bug likely never would’ve happened, as there would’ve been less “code surface”, if you will, susceptible to merge errors; and had the code been isolated and tested with a tiny unit test as I suggested, the test would’ve caught the error before the merge was committed to source control. This example suggests to me a lack of automated testing discipline and the code-level design knowledge that accompanies it, not the fallability of the existing discipline which Auerbach imagines that Apple already applies. (Though he is right, testing can prove only the existence of bugs, not their absence)

So the resulting code after removing the duplication looks like this²:

if(isRsa) {
    /* skip this if signing with DSA */
    dataToSign = hashes;
    dataToSignLen = SSL_SHA1_DIGEST_LEN + SSL_MD5_DIGEST_LEN;
    hashOut.data = hashes;
    hashOut.length = SSL_MD5_DIGEST_LEN;

    if ((err = HashHandshake(&SSLHashMD5, &clientRandom, &serverRandom,
         &signedParams, &hashOut)) != 0) {
      goto fail;
    }
  }
  else {
    /* DSA, ECDSA - just use the SHA1 hash */
    dataToSign = &hashes[SSL_MD5_DIGEST_LEN];
    dataToSignLen = SSL_SHA1_DIGEST_LEN;
  }

  hashOut.data = hashes + SSL_MD5_DIGEST_LEN;
  hashOut.length = SSL_SHA1_DIGEST_LEN;
  if ((err = HashHandshake(&SSLHashSHA1, &clientRandom, &serverRandom,
       &signedParams, &hashOut)) != 0) {
    goto fail;
  }

and the extracted, almost trivially-testable HashHandshake() function looks like this:

OSStatus HashHandshake(const HashReference* hashRef,
    SSLBuffer *clientRandom, SSLBuffer *serverRandom,
    SSLBuffer *exchangeParams, SSLBuffer *hashOut) {
  SSLBuffer hashCtx;
  OSStatus err = 0;
  hashCtx.data = 0;
  if ((err = ReadyHash(hashRef, &hashCtx)) != 0)
      goto fail;
  if ((err = hashRef->update(&hashCtx, clientRandom)) != 0)
      goto fail;
  if ((err = hashRef->update(&hashCtx, serverRandom)) != 0)
      goto fail;
  if ((err = hashRef->update(&hashCtx, exchangeParams)) != 0)
      goto fail;
  err = hashRef->final(&hashCtx, hashOut);

fail:
  SSLFreeBuffer(&hashCtx);
  return err;
}

In my example unit test as published in the AutoTest Central article, unlike Landon Fuller’s proof-of-concept unit test, I didn’t think to use XCode’s built-in testing framework—reason being that I’m diving into this code after being out of the field for some time, most of that time spent working in Linux, and I leaned on my command-line-UNIX-fu to get something working the fastest way I knew how. But my example makes the point that even without the Xcode framework, testing this piece of code was eminently doable—as was refactoring it.

There is also the question, raised by the aforementioned diff, whether there change itself being committed was too large—a definite “process smell” or “practice smell”, if not specifically a “code smell”. Again, good testing practice—actually, just good development practice in general—would dictate that development updates would be committed as a large series of small changes, not a small series of large changes. But, I’m not up to speed on the context of this diff and whether it was an automated merge operation or something else, and will refrain from casting any stones at this particular issue; and it’s feasible that it could’ve been the result of merging two independent development branches, which often do produce large, messy, one-off merges. But that is a big reason to write tests around the code in the first place—to make sure big, risky branch merges don’t break anything!

Goto Fixation Considered Harmful

That said, Auerbach is right that people have bashed on the code, and I agree that pretty much all of those such articles have missed the point, often by focusing on the presence of a goto statement as a sign of unmitigated e-ville, thanks to a shallow interpretation of Dijkstra’s A Case against the GO TO Statement (more popularly known as Goto Considered Harmful). For example, Patrick Chkoreff suggests refactoring the code to avoid the use of goto, and Christian Huitema suggests either mandating the use of curly braces with if statements or resorting to nested if statements to avoid the goto.³

The point of Dijkstra’s letter was to argue for more higher-level programming language support for conditional and repeat structures (i.e. if/then, for, while, until, etc.) that rely on the value of “indices” which are “outside the programmer’s control”; and against “unbridled use of the go to statement”. Bear in mind he wrote the letter in 1968, before such structures were common—before even the advent of the C programming language! Dijkstra goes on to say (emphasis mine):

I do not claim that the clauses mentioned are exhaustive in the sense that they will satisfy all needs; but whatever clauses are suggested (e.g. abortion clauses) they should satisfy the requirement that a programmer independent coordinate system can be maintained to describe the process in a helpful and manageable way.

goto itself isn’t the real culprit here, despite the fact that a repeated, spurious goto line produced the bug, and all the “goto is evil and the reason for the bug” commentary that followed. In truth, a premature return statement would’ve had the same effect, or poorly-placed curly braces wrapping the SSLHashSHA1.final() condition inside the failure case, like so (leaving in the repeated goto fail; as a visual anchor):

  if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) {
      goto fail;
      goto fail;
  if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
      goto fail;
  }

No, the original purpose of goto fail; is perfectly healthy and good for you, as it’s a common idiom in C for ensuring resources are released even in the event of a fatal error—i.e. the “abortion clauses” with which Dijkstra qualified his statement. The reason goto appears in this way in C code is because other languages have built-in support for the idiom, i.e. “indices…outside the programmer’s control”. For example:

C++: Destructors provide for automatic cleanup in the event of normal returns, error/early returns, and exceptions (when code is written with exception safety in mind); this is known as the RAII (Resource Acquisition Is Initialization) idiom.
Java: Provides try/catch/finally blocks—though scope-bound destructors would be nice, in place of repetitive finally blocks and nondeterministic finalizers!
Go: Provides defer, panic, and recover.
Python: Provides try/catch/finally and the with statement, which is used to implement RAII⁴

And, it must be said, whether the error had’ve been an early return statement or bad curly braces instead of the actual repeated goto, a proper, small unit test would’ve caught all three. Other tools and processes might’ve as well, but a unit test would’ve been the first—and arguably cheapest—line of defense, if for no other reason than the programmer would’ve been looking for ways to reduce duplication and isolate the logic as a testable unit before even touching the keyboard.

Is It Me?

So I guess I’m feeling a little bit like Cassandra in that I believe I see a huge clue to the origin of one of the biggest security bugs in recent memory staring everyone right in the face, in the immediate vicinity of the error—but I seem to be alone in assessing the duplicate code as a critical issue. In contrast, David Auerbach isn’t alone in neglecting to notice the duplication issue. Security guru Bruce Schneier mentions in his article Was the iOS SSL Flaw Deliberate? (again, emphasis mine):

The flaw is subtle, and hard to spot while scanning the code. It’s easy to imagine how this could have happened by error. And it would have been trivially easy for one person to add the vulnerability.

Was this done on purpose? I have no idea. But if I wanted to do something like this on purpose, this is exactly how I would do it.

While I understand his focus is security, not code quality, I’m still mildly surprised that someone with an obvious proficiency at coding could look at the context of the erroneous line and still not lean heavily towards this being a mistake that just wasn’t caught by the adequate application of design knowledge, tools, and procedures. Schneier links to Steven M. Bellovin’s Goto Fail article, who does, to his credit, assert that code coverage should’ve provided a clue, but immediately before that states (emphasis mine):

The real question, though, is why they didn’t catch the bug in the first place. It’s a truism in the business that testing can only show the presence of bugs, not the absence. No matter how much you test, you can’t possibly test for all possible combinations of inputs that can result to try to find a failure; it’s combinatorially impossible. I’m sure that Apple tested for this class of failure—talking to the wrong host—but they couldn’t and didn’t test for this in every possible variant of how the encryption takes place. The TLS protocol is exceedingly complex, with many different possibilities for how the encryption is set up; as noted, this particular flaw is only triggered if PFS is in use. There are many other possible paths through the code. Should this particular one have been tested more thoroughly? I suspect so, because it’s a different way that a connection failure could occur; not having been on the inside of Apple’s test team, though, I can’t say for sure how they decided.

While what Bellovin states is true of system testing, as I explain in both of my Small, Medium, and Large testing articles—see, that rug really does tie the room together!—the infeasibility of testing every code path at the system-level is no reason to conclude that this bug couldn’t have been tested for at a much lower level—on the contrary, it is exactly why you need a lot of smaller unit tests tickling all those corner cases! In other words, I don’t necessarily think it should’ve been a question for Apple’s test team; the developer should’ve been thinking of testing this apparently critical algorithm at the same time he/she was writing it—or copying it for the second, third, fourth, fifth, or sixth time.

Even an article as thorough as Arie van Deursen’s Learning from Apple’s #gotofail Security Bug calls out formatting, indentation standards/tools, code reviews, compiler warnings, static analysis tools, and code coverage as potential mechanisms that could’ve prevented the bug—but quotes Adam’s Langley’s Apple’s SSL/TLS Bug post in concluding that unit testing the code in question would’ve been difficult. On top of that, my fellow Test Mercenary Keith Ray, in his comments on that very article by van Deursen and his own TDD and Signed SSLVerifySignedServerKeyExchange article, advocates strongly for Test-Driven Development, and quotes Landon Fuller’s proof-of-concept unit test, but doesn’t point out the duplicate code smell and opportunity for refactoring present in the code as-is—and code like that is exactly why the Test Mercenaries were brought into existence in the first place, to point out these kind of opportunities for refactoring and improved testability in existing codebases!

What’s more, if memory serves, Keith even wrote the Testing on the Toilet article that advocated for breaking larger chunks of logic into smaller, more testable functions to avoid having a combinatorial explosion of test inputs—the very concern that Bellovin had mentioned as rendering exhaustive system-level testing infeasible.⁵

Quick Aside About GnuTLS

This morning I noticed my Mac and my Raspberry Pi both receiving updates to the GnuTLS library. It turns out that GnuTLS had a similar SSL certification verification bug. However, Chris Duckett is incorrect in asserting that in “both cases, incorrect goto calls have been the root cause of the security issues.”. A look at the CVE-2014-0092 bug report and the patch fixing it suggest a more subtle issue with reusing result variables improperly and possibly using uninitialized variables as return values. This was not the accidental inclusion of an extra goto.

I’ve not the time to dive into this, but again, where were the unit tests?

Le Sigh

Well, I’m well-acquainted with the fact that, sometimes, new insights take some time to sink in. At least, from all the reading I’ve done so far, it seems that most people are shocked that there weren’t tests, rather than claiming it would’ve been too hard and time-consuming to write them. Maybe my little seed of an idea about avoiding code duplication and refactoring it away as core aspects of the production of testable designs will one day germinate into some kind of visible insight adding substance to the “unit tests coulda-shoulda-woulda caught this” argument. I’m just a little surprised no one else has picked up on it but me, from what I can tell—or has thought it as important to discuss as myriad other aspects of the process.

Perhaps I’ll just have to keep repeating myself about this code duplication bug until somebody listens… ’Cause Lord knows I’m a voodoo child…

(And I’ll be damned, won’t you take a look at today’s featured Wikipedia article!)

Footnotes

Any relation to Dan Auerbach of The Black Keys? ↩
The following two code snippets first appeared in my AutoTest Central blog post, Finding the Worm Before the Apple Has Shipped, published under a CC BY 4.0 license ↩
Well-written and presented articles both, and I don’t intend to single them out for ridicule, though I disagree with their conclusions; they’re both a huge cut above the passing “goto is e-ville” anonymous blog comments. ↩
You’re welcome. ;-) ↩
Keith, if it wasn’t you, sorry! But I remember for a fact that such an article was written, and I’m pretty sure it was by a Test Mercenary. ↩