Landmine #6: Where is all my stuff? Get to know your inventory, utilize it, and optimize it.

“For want of a nail the shoe was lost.

For want of a shoe the horse was lost.

For want of a horse the rider was lost.

For want of a rider the message was lost.

For want of a message the battle was lost.

For want of a battle the kingdom was lost.

And all for the want of a horseshoe nail.”

(Proverb)

“The perversity of the Universe tends towards the maximum.”

(Finagle’s Law)

No-one likes surprises, least of all CxOs. As a technology executive, you need to be armed with fact, data and understanding so that you can reason about systems – and predict their behavior as well as to control their behavior. Otherwise you might end up being the victim of a cascading number of inadvertent failures – each one innocuous, but all adding up to literal chaos. There’s nothing like systems failure or a data breach to hurt financial and brand performance, and to destroy your personal credibility.

IMG_2543

One of the key mitigations for a complex system is to understand all of the components, and have an inventory of the components and the relationships (or dependencies) between components. You will therefore be able to model system behavior in a given set of circumstances.

The heart of the matter is that you need to know what technology resources you have (each box’s failure points (e.g. scale limits; certificate expiry dates; patch cycles, etc.); what box is running which software; who owns each box; how each box is dependent on each other; and, what are the licenses and economics associated with each box).

You especially don’t want to be in the position of your customer knowing about a systems outage and/or subprime performance before you do (through “shadow IT” monitoring and an empirical/tactical understanding of the system).

The ITIL standard is based on understanding systems as an inventory, with hardware, software, licenses, certificates, I/O interfaces, cron jobs or scripts, processes, and people as components of that inventory. This helps you model the effects of various types of failure of a component on the system as a whole, and helps you identify where changes can be made to make a system more robust.

Having an inventory is a prerequisite to predicting system behavior (and understanding Yellow Flags). This means you can be prepared for load, and market events (Black Friday, Cyber Monday, etc). This means you can prepare for Disaster Recovery, Business Continuity and High Availability – depending on your residual risk posture.

With this knowledge, you can build out (and test) failover strategies, contingency plans and failure isolation plans. Also, you can improve efficiency of your assets (knowing which ones need critical care) and allowing you to retire or replace non-performing assets. You can determine if you are spending too much or too little, and figure out where a different process or staff model or alarm strategy or reserve inventory strategy can help you the most, for the least amount of cost.

Most important of all, knowing your inventory and the system in terms of every component, improves your credibility and worth. You will be practiced and prepared to solve an unexpected problem, because you can isolate it more easily and resolve the problem more quickly. You will, thus, be value add to your customers – internal and external!

Let me put this in practical terms. Not knowing what you have can put you in the following situations:

– You have unplanned Software ELA True Up payments (a surprise the CFO will especially not appreciate).

– You will miss patching cycles on servers, potentially exposing you to security and performance risks.

– You will likely miss system alarms, because you don’t know what systems produce alarms.

– You will have unlabelled cables that connect to a server (do you want to play “kerplunk” and see what happens if you disconnect it, or is it the backdoor for hackers?).

– Worse, someone might be compromising your systems through an uninventoried server running a trojan or rootkit or some nasty bot (sifting through your data and transactions and sending it off to China India, Russian, Afghanistan etc.) because it evades your scans.

– License keys and certificates have a nasty habit of expiring at the worst possible moment, causing unplanned outages.

– You will have gaps in maintenance coverage, which really won’t help you when a server fails.

– Data center migration becomes challenging when you don’t know all of the servers that are part of your production environment.

– You are likely not using all of the available compute and/or storage capacity, and thus your ROIC is not maximized, contributing negatively to financial performance.

– You may not even know the size or types of packets or their size and frequency, without the tools to monitor them. This is an important part of your inventory, often overlooked.

– You will miss when a non-production server crosses the line, and becomes part of “production”; and will only find out when that server fails, because normal production rules aren’t applied to this server.

– You will have issues with compliance (SOX, PCI, PII, HIPPA, etc).

Such issues are commonplace in technology shops that haven’t learnt the benefits of ITIL and inventory. In these days, where systems administrators and application teams increasingly lack the basic knowledge of network fundamentals, it is even more important to model the behavior of your operations environment:

– Planned or unplanned downtime of one system impacts other systems because, these days, computing resources in data centers are much more likely to be shared.

– Similarly, increased demand on one application can impact other applications (especially in a SOA based architecture, you need strong performance and capacity oversight on your composite and fine grained service calls or DB requests).

– Each new development release introduces interfaces that may not be documented and cause problems for monitoring and application support teams.

– Disaster recovery plans fail because, even though you’ve funded the core operational servers, these servers rely on a server that no one other than the developers have heard of (and thus not included in the DR plan)!

This is a true story. When I started at a big company, I asked if “everything” was being monitored. My team said “Yes, of course.” After six outages in a three week period, I asked the team the same question. It turned out that the answer was “Yes, we monitor all the servers that we know about.” We had a lot to work on after I found out about that!

This is a large topic. Please email me if you have any questions.

Mike Ross <TechOpsExec@gmail.com>.

#techopsexec

Leave a comment