“A pint of sweat will save a gallon of blood.” (Gen. George S Patton)
“Mostly Harmless” (Douglas Adams) is a novel that opens with a scene reminiscent of the world of technology. A spaceship’s computer becomes inoperative because of a meteor strike, and an emergency routine is engaged to replace the computer with its backup. Ironically, this backup falls through the same meteor hole that the primary computer fell through – because the emergency routine couldn’t detect the hole. Welcome to the world of cascading failure, and how to mitigate them!
Failure happens, especially when systems are complex. As a technology executive, you must plan for failure – to ensure the impact is minimized in scope, duration and cost. Your company’s brand is at stake. You’ve got to think about rainy days, stormy days and hurricane days: literally, as well as figuratively.
In today’s complex technology ecosystems, protecting components (data/storage, network, compute capacity/performance) is only half the battle. Every time a complex system changes there is risk. New interfaces, data sources, data models, traffic patterns, utilization patterns, hardware and software can all throw a metaphorical spanner in the works. Not only should regression tests be involved in such changes as a matter of course, but also you should be on top of your inventory (using ITIL processes) and using a Change Management Database (CMDB).
Now, imagine how prepared you and your team are for any of the following:
1) Loss of a data center.
2) Prolonged impairment of an Internal network or Internet connectivity.
3) Loss of a primary database.
4) A major failure which will involve a switch to another data center/cloud, and require restoration processes. (How will command and control work? How will communication work? Who makes strategic and tactical decisions?).
5) A major failure which will require use of a valid backup. (Are your backups verified, validated and viable? How long will it take to restore? In which locations are backups kept?).
6) A major failure which requires a switch to a manual process. (Who makes the call? How will a backlog be dealt with?).
7) A significant security breach, or loss of customer data to unauthorized third parties.
8) Significant technical drift between the production environment and the DR environment (in terms of capacity, functionality, configuration).
9) A loss of e-mail during disaster or outage. (Do you realize how significant email has become as a “vital” system?).
10) Extended outages due to database failures; corruption that extend to backups; malicious or unintentional Denial of Service (DOS) issues; network broadcast storms; DNS/DHCP issues; loss of internet connectivity; HVAC failure; false or legitimate fire suppression automated activation; inability to refuel generators; virus/malware/hacking; middleware issues; and/or, employee malice.
So, after contemplating the above scenarios and your preparedness, I hope that you can understand that the world is a scary place and there are all sorts of failure that you have to anticipate.
Obviously your goal is to prevent (or mitigate) these issues and you must do the due diligence. But while an ounce of prevention avoids the pound of cure, you better know what the cure is, how long it will take, and how much it will cost.
One of the best things to do for preparedness sounds weird: practice! That is, you must do effective tabletop exercises and play through likely and unlikely failure scenarios. This is important because it tests your operational model, and when you test your model, you can find gaps in it. So, if your team is well versed in the technical recovery procedures when a data center goes down, do you know if they have a good grasp of how communicate with affected users and executive management? And what happens if email and cell phone communications are down at the same time? And who drives decisions in sub-optimal conditions, and you can’t get hold of the CEO, COO or CTO? And who decides which systems and applications go back up, and in what order? And knowing what the gaps are can help you build a robust Disaster Recovery and Business Continuity Plan – and help you figure out if you need to invest in resources to help you in the event of a failure.
Preparedness is also a state of mind. It can only be developed by actual experience of failure and/ or simulated experience of failure. Being prepared means that you have an inherent understanding of the components of your system, and how each component affects other components. So, when you are faced with the absence of a component, you can predict the likely sequence of failures that would arise.
Furthermore, preparedness can be a shortcut. If you have decided your priorities in advance (through regular business impact analysis) and have a various backup communications and business continuity procedures (and even certain types of technical hardware/ software), then it will be easier to get through an unpredictable emergency because you don’t have to think about your priorities, your backups or how in goodness name you are going to get in touch with each member of the command and control team in a specific order. Instead, you will know what resources you have and will spend time deciding on what to utilize, and not how to utilize.
Some preparedness is common sense. E.g., Have you valid backups (and test them regularly)? Have you placed your data center locations appropriately? Have you practiced and timed restoration? Some aspects of preparedness require investment (because the operations are so critical, there’s no scope for error or downtime). One of the other things you have to do is to figure out your residual risk posture, and act accordingly.
Disaster recovery is not only about having a good backup, but also about being able to get an operation going again quickly with new/ hired/ replacement/ spare/ cloud technical resources – and people with the appropriate processes and an ability to improvise based on pre-established priorities.
There’s a lot to talk about. Please join me in the conversation.
Mike Ross <TechOpsExec@gmail.com>.
#techopsexec
