RSS
 

Back

24 Apr

I’m back from my mini-vacation and it sounds like things are crazy at the office, but I think even so that I should have time to start writing again and I look forward to doing so.

My first order of business tonight is to start reading Systemantics: How Systems Work and Especially How They Fail by John Gall which was recommended by Raymond Chen and therefore automatically must be worth reading. Because I’m sure many of you are too lazy to actually follow that link, here’s the relevant RChen review:

Systemantics is very much like The Mythical Man-Month, but with a lot more attitude. The most important lessons I learned are a reinterpretation of Le Chatelier’s Principle for complex systems (“Every complex system resists its proper functioning”) and the Fundamental Failure-Mode Theorem (“Every complex system is operating in an error mode”).

You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed.

That’s why I’m skeptical of people who look at some catastrophic failure of a complex system and say, “Wow, the odds of this happening are astronomical. Five different safety systems had to fail simultaneously!” What they don’t realize is that one or two of those systems are failing all the time, and it’s up to the other three systems to prevent the failure from turning into a disaster. You never see a news story that says “A gas refinery did not explode today because simultaneous failures in the first, second, fourth, and fifth safety systems did not lead to a disaster thanks to a correctly-functioning third system.” The role of the failure and the savior may change over time, until eventually all of the systems choose to have a bad day all on the same day, and something goes boom.

Time to hit the books!

 

Comments are closed.