June 01, 2008

Case Study 2: OpenSSL's patched-out randomness

In the aftermath of the OpenSSL failure due to a vendor patch (which bit the vendor badly...) there has been more analysis. Clearly, early attempts to describe this were flawed, and mine was no exception, as the precise failure was not well described.

I did miss out on one important thing, pointed out by Philipp Güring: when doing high-sec apps, it is necessary to mix in different sources, because we should assume that the lower layers will fail. But, while necessary, it is not sufficient. We still require to show that our mixed randoms are getting into the right place. Then, we need some form of testing strategy that shows this. I suggest:

  1. all sources be set to some fixed X like zero, and then run many times, to show that the result is the same each time,
  2. each source be singly varied, and then show that the result varies each time,
  3. test each result for randomness,
  4. read the code. (So, use simple code for this area.)

In addition to other's comments, I found this extremely useful, from David Brodbeck, posted in comments on EC:

In aviation it's common to talk about an "accident chain," the series of small mistakes that lead up to an incident. Breaking any one link in the chain would have stopped the accident from happening. That's kind of what happened here.

I suggest we incorporate the accident chain into the security lingo. We can guess that each mistake by itself was probably innocuous, otherwise we would have fixed it. Instead of the current 'best practices' of fingerpointing, it is then much better to document the chain of mistakes that led to the accident. And think about how you are going to deal with them. Ideally, be able to show that anyone component could fail completely, and disaster would not then follow.

Finally, as I mentioned, it's better to own the problem than to avoid it. I was heartened then to see that Eric Young wrote:

I just re-checked, this code was from SSLeay, so it pre-dates OpenSSL taking over from me (about 10 years ago, after I was assimilated by RSA Security).

So in some ways I'm the one at fault for not being clear enough about why 'purify complains' and why it was not relevant. Purify also incorrectly companied about a construct used in the digest gathering code which functioned correctly, but purify was also correct (a byte in a read word was uninitialised, but it was later overwritten by a shifted byte).

One of the more insidious things about Purify is that once its complaints are investigated, and deemed irrelevant (but left in the library), anyone who subsequently runs purify on an application linking in the library will get the same purify warning. This leads to rather distressed application developers. Especially if their company has a policy of 'no purify warnings'.

One needs to really ship the 'warning ignore' file for purify (does valgrind have one?).

I personally do wonder why, if the original author had purify related comments, which means he was aware of the issues, but had still left the code in place, the reviewer would not consider that the code did some-thing important enough to ignore purify's complaints.

So what we saw was a timebomb that had been ticking for around 10 years. As Ben Laurie mentioned, an awful lot has changed in our knowledge and available tools since then, and as mentioned above, the original code author has long since left the scene. Definately a case for avionics accident investigation. I wonder if they write books on that sort of stuff?

Posted by iang at June 1, 2008 05:11 PM | TrackBack

I have recently read a fascinating book about aviation accidents. Unfortunately (for you) it was in Hungarian. And it was one of those rare cases of having been originally written in Hungarian (by a veteran Hungarian airlines pilot), not a translation.

Posted by: Daniel Nagy at June 3, 2008 07:19 PM
Post a comment

Remember personal info?

Hit preview to see your comment as it would be displayed.