The Outage that Never Happened
AUTHORS: Mark Knight and Bernard Gaudreault
Resilience is an interesting topic. It is something that everyone intrinsically understands yet there are many different definitions and yet more interpretations of those definitions. To be, or not to be, that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles, and by opposing end them: to die, to sleep. That, as you probably know, is from William Shakespeare's play Hamlet, Act 3, Scene 1. When it comes to resilience the debate is often whether it is better to resist the failure of an asset system at all costs or to suffer the impacts of the failure but to recover quickly. There are good arguments to be made for each.
Putting measures in place for either approach requires investment. For heavily capitalized businesses this may require investments in the range of millions to even billions of dollars. This is where it starts to get interesting. Unless you happen to have a billion dollars lying around unused you need to borrow the money, raise capital for the investment, or get approval to redirect existing investment plans. Inevitably you will have to address the question of what are you delivering for that money? In other words, how do you measure the benefits?
If you invest in strengthening the system in order to resist the factors that may cause the system to fail, how do you measure the benefits? Whether we focus on avoiding an outage or whether we focus on recovering quickly what we are trying to do is to reduce the magnitude and the duration of an outage. When an outage occurs we can measure its impact but in the limit we are aiming for zero impact for zero time so if an outage never happens how do you measure the avoided impact without getting into modeling and arguments over approaches?
We would argue that the concept of resilience applies to a system, not to individual assets. We can measure the performance of each asset including its availability but measuring the resilience of a system is more difficult. Much more difficult. There are several distinct elements to resilience. We have already discussed two: avoiding disturbances and recovering from them but there is also the time where the impact has been felt but the event is still unfolding while we are still responding to it. This happens after our attempted avoidance and before we can proceed with recovery. There are arguably more stages during the life of a resilience event but even with just these three stages one thing is apparent: their characteristics are very different.
This is important when it comes to objectively quantifying resilience. Since the characteristics of each stage are different, the ways we need to measure each of them will be different. What if there are other systems dependent on your system and losing your system impacts other systems?
According to Shrek, ogres are like onions. So is Resilience.
Download the article
5. The Outage that Never Happened.pdf