Relief vs. Root Cause Analysis
A response to a discussion at JeremyK's
excellent MS weblog: Link
This is a question that rarely comes up for your typical firefighter: relief (put out the fire!) goes before root cause analysis (it was a pan of grease on the stove). Much easier prioritization decision!
It's a balance, just like you have a balance when deciding whether to take all your servers down right now to apply the latest series of security patches. Is the risk of a self-inflicted denial of service higher than the risk of lost sales while I reboot? If so, you might want to consider patching over the weekend. Is the risk of an exploit and associated losses higher than the risk of possible lost sales? Then bring'em down, cowboy.
The balance in this case is based on, amongst other things:
A- how much the analysis time is costing in lost productivity or business opportunity while the problem is ongoing, vs. how quickly I could apply relief and reduce that cost
B- how much longer it will take to perform further analysis, before I can determine root cause
C- how soon the problem will reappear if I just apply relief. If I reboot tonight and the problem goes away for two years, I'll be much more willing to select "relief" than a scenario where I'll be called back in an hour.
D- the level of confidence I have in my answers to the above questions. Also, the level of confidence I have that the Relief I will be applying will actually work.
E- the value of the data lost when I apply relief. If logs are critical to resolving the root cause, and I will lose them completely by applying relief, I'm less willing to do so. If I can apply relief and still work on RCA, that's more palatable.
F- SLAs. If my company gets financial penalties on downtime per incident, my incentive to search for root cause may be diminished. If I get penalties on cumulative downtime, I may want to resolve the problem for good, in which case I want root cause.
G- closely related to the above: how loudly is the client shouting in my ear?
H- how long I've been working on the problem with no forward movement. I'm more willing to provide relief and throw in the towel on RCA if I've been trying to fix the problem for 48 hours and I don't feel I've made progress. If I feel the solution is "just around the corner" I'm more willing to continue analysis.
I- how much sleep I've had in the past 48 hours, and whether the coffee in the breakroom is any good.
The last one is not completely in jest. Root Cause analysis often requires a sharp, focused, alert! mind that is in tune with the environment and can detect minute anomalies or variations from the norm. Rebooting takes one binary brain cell and an index finger.
All of the factors above are balanced in the equation:
X = lim (( A / I^2) * (B - G/D) + C)
H -> F
Plus a constant, of course. As X trends to 1, I'll be more willing to just go to Starbucks rather than drink any more of that overwarmed pot sludge.