Landmine #2: Your are only as good as your last outage…

Previously, I discussed the need to do no harm to your brand. This applies to internal as well as to external customers.

For your peers or customers chronic services issues create fear, uncertainty and doubt. This is unhelpful when they are dependent on you for critical services. Some service issues not only create brand damage, but have direct financial consequences (penalties, fines, lost revenue, lost accounts, etc.). Executives who fail to take care of such issues can, of course, prepare three envelopes.

Excellent fire fighting skills are essential to Technology Operations organizations. However, you don’t want fire fighters to manage everyone, or else you will only be putting out fires.

I’ve been to all of these movies. I know the scary parts. If you recognize any of the following behaviors, then you will be spending a disproportionate amount of time with fire hoses:

– Hero culture. Rewarding firefighters as heroes has an unintended consequence: often, these heroes start fires. Or the operations staff feel un-empowered to suggest improvements or their suggestions have fallen on deaf management ears for so long, they stopped making them. Of course, you always have some thrill seekers in your staff that relish the challenge and excitement of the mystery or puzzle of an outage. See that the puzzle solvers thrive, and that fire starters do not. (See City on Fire, Backdraft, Kill Bill).

– Timebombs. Unfixed sins of the past introduce failure points, poor problem isolation capabilities (excessive complexity) and no forensic data. (See Apollo 13, Gremlins, Snakes on the Plane).

– Landmines. Unreformed or disused practices (such as architecture, design, development, testing, maintenance) allow the timebombs to get worse and more frequent: you lose a lot of limbs when launch a product, etc. (Ibid).

– Chaotic, overwhelming business demands.  Your processes for thoroughly vetting and implementing new services, technology, products, applications are skipped or bypassed frequently. Your development teams ignore or defer operational and security requirements. Or worse, there are no requirements. (See Speed).

– There are no warning (Yellow) flags. If your team and monitoring tools are fixated on Red flags, or maintaining Green flags, you are unlikely to understand root-causes of issues and their warning signs. Green flag focus indicates measuring the wrong things or individual things, without understanding the system from a customer perspective. Red flag monitoring indicates a reactive mindset. Monitoring and understanding Yellow flags, gives the teams much greater insight and ability to predict the causes of failure and when such failure will occur. (See The China Syndrome).

– Complexity. You have hundreds or thousands of composite services and applications in a SOA framework. The mathematics of the situation is, literally, chaotic when a small change occurs. Eventually, one such change will bring down something mission critical. There are techniques for mitigating such failures that would otherwise occur at an awkward time (during a sale or promotion). (See Too Big to Fail, Greed, The Butterfly Effect).

– Siloed cultures. Your technology operations lack of business knowledge to understand how business events may impact your normal operations. Or conversely, the business has not been informed on the transactional capacity ceiling of the website/ database/ storage/ network and technology operations can handle it. The importance of technology and business joint partnerships cannot be understated. (See Comedy of Errors, Laurel and Hardy, The Keystone Kops and that famous line from Cool Hand Luke “…what we have here, is a failure to communicate.”).

– No joined up thinking and control. Businesses need a solid practice of incident management to collect, correlate, communicate data, understand it – and take appropriate isolation, service restoration or repair actions while always keeping foremost in mind the customer experience. (See The Longest Day, A Bridge Too Far).

– Too many “Get Well Plans” / “Band Aids.” There is no point trying to solve the same set of problems with one approach that you repeat every year. This is the same logic as going over the top in the First World War, doing the same thing in the hope the enemy doesn’t expect the same thing. It is Einstein’s definition of insanity. (See All Quiet On The Western Front).

As a technology executive, you have a choice. You can wait for the clock to count down to your next major service outage – and play the role of a victim of circumstances.

The first step is obvious – acknowledge that there are problems, and that there are accepted methodologies to solve these problems. Acknowledge also that change may not require a large investment in capital (although some investment money will help) but a change in the way you align with the CEO and partner with your peers, and and a change in the way you manage your organization and culture. And then get stuck into Root Causes, Action Plans (short, medium, or long term) and changes to (wait for it…) people, process, technology, measurement, accountability, incentives, and budget.

If you are on top of your Technology Operations game, then you have already identified the major areas of risk in your organization. And you have a solid measurement program, a crackjack problem management team, and have made inroads partnering with the product management, marketing, development and engineering organizations. And you are using ITIL effectively as a framework for service design, service transition and service operations improvements.

If you’d like to know more, please join me in this conversation.

Mike Ross <TechOpsExec@gmail.com>.

#techopsexec

Leave a comment