February 14, 2004

Crash-only Software

Here is a recent paper on the notion of programs only recovering and only recovering fast. That is, they never get shut down, only killed. They never start up "normally", only recover.

Ricardo has always done this, and in fact the accounts engine, LaMassana, makes a big play of this design principle as its secret weapon to achieve high reliability and high performance in not so many lines of code. The other components also do that, but aren't stateful, so are less interesting.

There is one area where Ricardo deviates from the paper - pre-emptive or algorithmic crash-rebooting. As we are doing transactions, we want to know the cause of every crash, and either fix it, or not mask it.

Referenced here in the FC-KB

Abstract:: "Crash-only programs crash safely and recover quickly. There is only one way to stop such software by crashing it and only one way to bring it up by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users. In this paper we advocate a crash-only design for Internet systems, showing that it can lead to more reliable, predictable code and faster, more effective recovery. We present ideas on how to build such crash-only Internet services, taking successful techniques to their logical extreme."



Addendum 2004-07-20 - Zooko alerted me to this blog entry on the paper.

Posted by iang at February 14, 2004 05:53 PM | TrackBack
Comments