No system is fault-proof, but high volume, high velocity and high throughput (or whichever flavor of HPC you practice toady) systems often are expected to be. Every time we state otherwise we either lie to our clients, our colleagues, or worse – ourselves.
That said, some measures can be taken to mitigate the adverse effects of such failures, giving us an overall more stable system. Put to practice, we are interested in maximizing MTBF (mean time between faults) and minimizing MTTR (mean time to recovery). This is very impotent does not need much explaining. What is worthy of you wasting your time reading and me wasting my time writing about is the taxonomy of the common ideas, concepts and tools in our fight to idiot proof the universe.
The general premise here is that we now have a seemingly well working system but, as always, we have no guaranties for how long this “time of peace” will last.
The two reasons we have faults (and how they are really the same reason)
First i’ll start off with the reasons:
- Other systems we depend on will fail
- Programmers we depend on will fail
This is what we in the industry call (and excuse the terminology) obvious. These two constrains are universal, every piece of code is relying on other to work, and naturally, written by someone (yes, even computer generated code was written once by someone in some manner).
My claim that these two ideas are basically the same does not simply imply that programmers are systems on which our code is dependent of (this is also verging on the obvious), but that every error, problem or exception in a different system that causes our own to fail is the fault of no other but of our very own. Be as dumb, futile or impotent as they are, other programmers just can’t take responsibility for your code failing. When handling a production system Pokemon exception handling is no longer the misdemeanor it was in school, but a necessity.
The Three things we can do to handle problems better (and how they are derived naturally from the one another)
- better testing
- better logging
- better compartmentalization of our compute process
The first is a known problem – tests should be written but never are. The foremost importance of the tests is not to make sure everything is working, but to make sure we didn’t broke anything lately. This means writing the tests per user feature, and running them automatically whenever we can.
The second is a simple one – log the hell out of your system. every section of code should be able to notify in any change of state, and every problem it might encounter to a centralized location, where all logs can be inspected. This last part is important, if you can’t inspect the logs, why bother? what exactly to log is a question as difficult as “what exactly to comment”, but unlike comments – the testing process will make it painfully (painfully) clear what logs you should have.
The latter is more of a concept. If earlier I said that the “catch-all” concept is a necessity I will now elaborate – you would think the tests would catch every bug, but remember that second reason. -The programmer (namely us) probably missed something. For that we use a catch-all trap. These should be placed at strategic points, in the root of every major computational subsystem you have. Segmenting your big system into “services” is a good way to look at it – every such service is a self-contained part of the overall process and can be retried safely if fails. That is the first half of the picture – the second is to create a test that recreates the bug – and then solve it, and for that my dear, you need some serious logs in place.