July 04, 2005

Learning from Failure

When we build complex systems, we mean we build systems that are too complex for any person to understand. Any one person - it's possible for one person to understand a module completely, or an overview of all the components, but not to understand the way the whole thing works.

An inevitable result of this is that complex systems fail in strange ways. And it is as perversely inevitable that we often only really advance our understanding of complex systems in the examination of failure. It's our chance to learn how to be more complex.

Here are some notes I've picked up in the last couple of months (and an old entry):

1. Wired reports on how the people stuck in twin towers of the WTC ignored standard safety rules and also ignored what they were told. They used the elevators and stairs and scarpered.

The key lesson here is that the people on the scene often have more information than the experts, in both qualitative and quantitative terms. Wired takes this and says "Disobey Authorities" but that's going to far (Cubicle comments). What is more pertinent is that when the information is clearly better on the ground, then encourage your people to see that and work with it. Drop the rules if they see the need, knowing that if they get that judgement call wrong they will face the music later on.

2. Over in the land of military madness called the Pentagon, they have just that problem. The solution - train the corporal to fight the insurgent on his own terms - seems to be an old one as it was considered learnt at the end of the Vietnam war by the US Army. At least, that was the wisdom I recall from countless military books and articles. Read this article for why it has been forgotten.

I'm not sure what the lesson is here, and indeed, the late John Boyd had a name for the syndrome. He called it "stuck in ones own OODA loop" and said there was no solution.

3. In another episode of safety engineering (seen on TV), the design and maintenance of the cockpit window in a jetliner came under question. At cruising altitude, it popped out, sucking the captain out and trapping him half in and half out. Rather uncomfortable at 10,000 metres.

Why this happened was a series of 13 identified failures, every one of which if it hadn't have happened would have stopped the failure. Mostly, the TV program focused on the hapless maintenance engineer who openly and honestly described how he had followed 'local' procedures and ended up being an unwitting installer of the wrong bolts.

There are three lessons in this story.

Firstly, the easiest lesson is to make your designs fail safely. These days aircraft windows are designed to be fitted from the inside so they can't pop out under cabin pressure. That's a fail-safe design.

Secondly, and more subtly, design your safety features to fail obviously! 13 different failures - yet they all kept going until the last one failed? Why wasn't one of these failures noticed earier?

Finally, the subtle lesson here is that local conditions change - you can write whatever you like in the rule book, and you can set up whatever you like in the procedures, but if they are things that can be changed, ignored, bypassed, or twisted, then they will be. People optimise, and one of the things the love to optimise away is the rule book.

4. In TV documentaries and films, we've all no doubt seen the story of the O-ring engineer who was brow-beaten into silence before the shuttle went up. The safety system was overridden from on-high, because of commercial interests. We saw this same pressure a few weekends back in the farcical United States Grand Prix (Formula 1) race that dropped 14 cars because the tires were declared unsafe. All the bemoaning of damage to the sport and lack of compromise misses the key point - the safety checks are there to stop a wider Challenger-style disaster.

So money matters, and it often overrides simple and obvious safety issues, because when it doesn't, all monetary hell breaks lose. We see this all the time in security and in financial cryptography where basic flaws are left in place because it costs too much to fix them, and nobody can show the ROI.

The lesson then is to calculate the damage and make sure you aren't endangering the system with these flaws. When I design FC systems I try and calculate how much would be at risk if a worst-possible but unlikely crypto break happens, such as a server key compromise. If I can keep that cost under 1% of the entire system, by way of hypothetical example, I declare that "ok". If I can't, I start thinking of an additional defence-in-depth that will trigger and save the day.

It's by no means perfect, and some would criticise that as a declaration of defeat. But commercial pressures often rule the day, and a system that isn't fielded is one that delivers no security.

Risk analysis is the only way. But it's also expensive to do, far too expensive for most purposes, so we simplify this with metrics like "total system failure thresholds." For an FC system 1% of the float could be a trigger for that, as most business can absorb it. Or, if you can't absorb that, then maybe you have other problems.

5. One of the big lessons of failure is redundancy. All things fail, so all things need alternates. I can't say it better than this closing discussion in the engineering failure of the WTC:

Professor Eagar: I think the terrorist danger will be other things. A terrorist is not going to attack the things you expect him to attack. The real problem is pipelines, electrical transmission, dams, nuclear plants, railroads. A terrorist's job is to scare people. He or she doesn't have to harm very many people. Anthrax is a perfect example. If someone could wipe out one electrical transmission line and cause a brownout in all of New York City or Los Angeles, there would be hysteria, if people realized it was a terrorist that did it.

Fortunately, we have enough redundancy -- the same type of redundancy we talk about structurally in the World Trade Center -- in our electrical distribution. We have that redundancy built in. I shouldn't say this, but this was how Enron was able to build up a business, because they could transfer their energy from wherever they were producing it into California, which was having problems, and make a fortune -- for a short period of time.

NOVA: Gas pipelines don't have redundancy built in, though.

Eagar: No, but one advantage of a gas pipeline is the damage you can do to it is relatively limited. You might be able to destroy several hundred yards of it, but that's not wiping out a whole city. The bigger problem with taking out a gas pipeline is if you do it in the middle of winter, and that gas pipeline is heating 20 percent of the homes in the Northeast. Then all of a sudden you have 20 percent less fuel, and everybody's going to have to turn the thermostat down, and you're going to terrorize 30 million people.

The lesson we have to learn about this kind of terrorism is we have to design flexible and redundant systems, so that we're not completely dependent on any one thing, whether it's a single gas pipeline bringing heat to a particular area or whatever.

Remember the energy crisis in 1973? That terrorized people. People were sitting in long lines at gas pumps. It takes five or 10 years for society to readjust to a problem like that. What happened in the energy crisis in 1973 was we had essentially all our eggs in one basket -- the oil basket. But by 1983, electric generating plants could flip a switch and change from oil to coal or gas, so no one could hold a gun to our head like they did before.

(Snippet taken from some site that tries and fails to make a conspiracy case. 1, 2 2nd page has snippet.)

Good stuff. Now try and design a system of money issuance that doesn't have a single point of failure - that's good FC engineering.

Posted by iang at July 4, 2005 09:52 PM | TrackBack

Spare capacity, of course, costs money and is the first thing to go when management strives for "efficiency". It also absorbs failures before they can spread. That can be good or bad.

The rule of making failures obvious underpins the "lean production" idea. A Japanese car plant deliberately lives without extra inventory so that failures call immediate attention to themselves. Run this way for a while, commit yourself to *fixing* the causes of failures, and you get a system which is both reliable and economical.

A perfect everyday example of making failures obvious, without cutting safety margins, is natural gas. That nasty smell you get from a gas leak? It's a chemical, from the same family as skunk spray, that the gas company adds to the methane before they pump it into miles of pipe.

In a perfect world, insurance companies would audit computer systems like Lloyd's used to inspect ships and would set premiums accordingly. That would fix the ROI problem.


Posted by: Beryllium Sphere LLC at July 5, 2005 03:54 AM

ACM had an inteview with Bruce Lindsey on related topics. The interview is focused on designing for failure. They cover error reporting, recovery, Heisenbugs, graceful degradation, lack of language support for detection, failed members ("If there are five of us and four of us think that you’re dead, the next thing to do is make sure you’re dead by putting three more bullets in you.") and so on. It is primarily focused on engineering for failure for database systems, but there are a number of parallels to other disciplines.


Posted by: Gunnar at July 5, 2005 08:45 AM

Insurance. Bruce Schneier and many others think, it will solve the information security problem, in conjunction with establishing the provider's responsibility.
I have the following reservations about insurance companies:
1. Their profits do depend on knowing the risks better than their customers. As long as we are as clueless about the risks in information security as we usually are, it's much cheaper to scare us into paying huge premiums than to assess our risks accurately. Insurance business is riddled with assymmetric information.
2. Insurance fraud. In information security, it is significantly harder to detect or prove than elsewhere.
3. see the more general remark below:

On a more general level. The industrial-age thinking of achieving economies of scale by concentrating capital and amassing human effort that worked so well in the past three centures is increasingly often breaking down in the XXIst century. Small pockets of insurgents successfully take on the most sophisticated (?) mechanized military of the world. Blogs take on major news agencies. Filesharing networks take on the recording industry. (and we are hoping to take on banking, don't we? :-) )

Big organizations fail in small ways all the time and don't even have a chance to notice due to their sheer size, while sometimes they fail big with disastrous consequences for large numbers of people. The little guy has only so many chances of little failures before getting out of the game, while they simply can't fail in spectacular way with disastrous consequences, precisely because they are small.

I think that learning from failure is something at which big organizations are particularly bad, because small failures don't hurt enough, and natural selection (which can be thought of as a way of learning from failure) doesn't get a chance.

Posted by: Daniel A. Nagy at July 5, 2005 04:10 PM

One problem is that our society, faced with these complex systems, sets up meta-systems to 'prove' safety, robustness, etc. (Sarbanes-Oxley is an example, I suppose).

These systems involve, usually, collecting data about the real systems - ie putting ticks in boxes. It's a good idea because it's easy to handle this sort of data: you can produce instant statistics about how often you check things, how successful the checks were, how departments compare with each other, etc. It's better than not chceking anything.

But it's also very easy to get into the mentality that a tick in a box means that the system really is safer, more robust,less likely to fail, etc. You're right that we need to 'learn to be more complex' - in the face of the human (and corporate financial) need to make things simpler and easier to do.

Posted by: David Upton at July 6, 2005 12:31 PM
Post a comment

Remember personal info?

Hit preview to see your comment as it would be displayed.