It surprises me how not enough Operations organizations emphasize and measure automation progress and achievements. It is a target rich environment of repetitive manual tasks across a vast quantity of servers, service requests, tickets, alarms, test cases, etc. A growing company adds more “stuff” annually and should be reluctant to add staff or operational costs. Automation is an essential tactic or even a strategy to manage more with the same or less staff. At the end of the day, it is imperative to transform your labor force from task doers to value creators. Because those jobs stay and task doers get outsourced, off-shored or downsized. Basic metrics can tell the story of your progress: managed servers / FTE, TB managed / FTE, Tickets / FTE, % actionable alarms auto ticketed, etc.
I’m going to spend just this one paragraph on basic areas of automation opportunity and move on the more interesting and challenging topic of Managing Automation. IaaS (Server Provisioning, Storage Provisioning, Network provisioning), Alarm to auto-ticketing to self-healing, end-user account provisioning and password resets, Service Desk requests/fulfillment, backups, configuration management, software distribution (desktop and server), patching, syslog scanning, SLA Management and Operations reporting and the mother lode of software regression testing. This is just the basic sets of activities that must be done, is repetitive and is high volume.
Now that we quickly covered what can and should be automated (and there are many more and we haven’t even touched on business processes), there should be some strategic thought on the operational components and requirements to apply to a single script as well as an entire automation program:
- Labor Force: Employees / Suppliers
- Workflow / Tools
- Controls / Scheduling / Error handling
- Maintenance and Maintainability
Labor Force:
Whether in-house, off-shored or outsourced, incentive (carrot preferably, but the stick may have to be used) to automate must be a performance based objective to teams or to the individual contributor. Even if the individual contributor doesn’t have the skill set themselves to automate, they understand the process they follow (and the exceptions – more on that later) and help identify opportunities to make them more productive. For example, if to perform their tasks, they have to login to multiple systems and cut / paste information among them. The savvy employee will automate tasks/steps themselves with what tool or scripting language they know already. Matter of fact, the ability to script and automate should be a mandatory condition for advancement. If you outsourced your operations, your vendor (hopefully partner) is literally banking on using automation to drive their costs down to run your operation. Make sure your contract allows you to share in those per unit cost savings over time.
Workflow:
More sophisticated or large scale automation efforts require some sort of workflow engine and task management system. There are many vendors such as BMC (Remedy, Bladelogic), HP (CSA and Operations Orchestration), ServiceNow, IPsoft who provide frameworks, services or complete SaaS options to help automate. If you go with a framework, you are also committing to training some staff on how to automate within the tools. All of these frameworks allow the concept of taking an order/request, decomposing into tasks that need to be performed (some of which can be automated and others that have to be assigned to a team queue or individual). I’ve seen very large enterprises and service providers who actually have all of the above in addition to the basic grass-roots use of scripting languages.
Controls / Scheduling / Error Handling / Operational Reporting
Automation is so great that it performs tasks much more quickly than a human. Automation is so dangerous, it can screw up one thousand servers in an hour. Those screw ups happen because of a typo, wrong set of targets, or unanticipated circumstances due to encountering something unexpected. So a few cautionary notes and questions…
- Make sure you test on one, then several before you unleash it to a very large target that can have dire consequences for your business it there is a problem. (BTW, can you undo it? Probably with automation written on the fly in the middle of the night. Uh oh!!!!)
- How do you know the automation ran and worked successfully? How do you know it did all the tasks, all the servers, all the alarms, all the tickets that particular Thursday evening? How do you know if it failed halfway through the target?
- What happens when it hits a task or server that doesn’t respond correctly? Does it log/alert and move on? Does it stop entirely? Who knows and when will they know? When should they know? Or will you call in the middle of the night the person who wrote it and they tell you it can wait until morning?
- One often missed gotcha is that the automation never kicked off in the first place. NOCs are pretty good at catching a ‘bad’ event. Typically horrific at catching when good things didn’t happen when they were supposed to. How do you know everything that was supposed to run, indeed did run on the target it was supposed to?
- How is the automation kicked off? Manually? Automated? What is the business impact if the automation fails? Are your controls commiserate to the business impact?
Maintenance and Maintainability
So you successfully are running automation for a few weeks, months and even years. Then it stops working. Maybe due to an O/S, database or application upgrade. Do you know who wrote the script, where to find it? Is it in a language many people are proficient it in? Is the script documented / commented? Where is it documented that the script exists, what it does, who wrote it, when it was last changed by whom, when did it last run successfully and what version is it? Scripts are code too and you might want to think about how much SDLC rigor you need to apply.
Automate wisely. Or die painfully due to not trying or doing it poorly 😉