Comments: Learning from Failure

Spare capacity, of course, costs money and is the first thing to go when management strives for "efficiency". It also absorbs failures before they can spread. That can be good or bad.

The rule of making failures obvious underpins the "lean production" idea. A Japanese car plant deliberately lives without extra inventory so that failures call immediate attention to themselves. Run this way for a while, commit yourself to *fixing* the causes of failures, and you get a system which is both reliable and economical.

A perfect everyday example of making failures obvious, without cutting safety margins, is natural gas. That nasty smell you get from a gas leak? It's a chemical, from the same family as skunk spray, that the gas company adds to the methane before they pump it into miles of pipe.

In a perfect world, insurance companies would audit computer systems like Lloyd's used to inspect ships and would set premiums accordingly. That would fix the ROI problem.

--Fred

Posted by Beryllium Sphere LLC at July 5, 2005 03:54 AM

ACM had an inteview with Bruce Lindsey on related topics. The interview is focused on designing for failure. They cover error reporting, recovery, Heisenbugs, graceful degradation, lack of language support for detection, failed members ("If there are five of us and four of us think that you’re dead, the next thing to do is make sure you’re dead by putting three more bullets in you.") and so on. It is primarily focused on engineering for failure for database systems, but there are a number of parallels to other disciplines.

http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=233&page=1

Posted by Gunnar at July 5, 2005 08:45 AM

Insurance. Bruce Schneier and many others think, it will solve the information security problem, in conjunction with establishing the provider's responsibility.
I have the following reservations about insurance companies:
1. Their profits do depend on knowing the risks better than their customers. As long as we are as clueless about the risks in information security as we usually are, it's much cheaper to scare us into paying huge premiums than to assess our risks accurately. Insurance business is riddled with assymmetric information.
2. Insurance fraud. In information security, it is significantly harder to detect or prove than elsewhere.
3. see the more general remark below:

On a more general level. The industrial-age thinking of achieving economies of scale by concentrating capital and amassing human effort that worked so well in the past three centures is increasingly often breaking down in the XXIst century. Small pockets of insurgents successfully take on the most sophisticated (?) mechanized military of the world. Blogs take on major news agencies. Filesharing networks take on the recording industry. (and we are hoping to take on banking, don't we? :-) )

Big organizations fail in small ways all the time and don't even have a chance to notice due to their sheer size, while sometimes they fail big with disastrous consequences for large numbers of people. The little guy has only so many chances of little failures before getting out of the game, while they simply can't fail in spectacular way with disastrous consequences, precisely because they are small.

I think that learning from failure is something at which big organizations are particularly bad, because small failures don't hurt enough, and natural selection (which can be thought of as a way of learning from failure) doesn't get a chance.

Posted by Daniel A. Nagy at July 5, 2005 04:10 PM

One problem is that our society, faced with these complex systems, sets up meta-systems to 'prove' safety, robustness, etc. (Sarbanes-Oxley is an example, I suppose).

These systems involve, usually, collecting data about the real systems - ie putting ticks in boxes. It's a good idea because it's easy to handle this sort of data: you can produce instant statistics about how often you check things, how successful the checks were, how departments compare with each other, etc. It's better than not chceking anything.

But it's also very easy to get into the mentality that a tick in a box means that the system really is safer, more robust,less likely to fail, etc. You're right that we need to 'learn to be more complex' - in the face of the human (and corporate financial) need to make things simpler and easier to do.

Posted by David Upton at July 6, 2005 12:31 PM
Post a comment









Remember personal info?






Hit Preview to see your comment.
MT::App::Comments=HASH(0x5582bfa683d8) Subroutine MT::Blog::SUPER::site_url redefined at /home/iang/www/fc/cgi-bin/mt/lib/MT/Object.pm line 125.