“Perception is reality.” (Lee Atwater)
“The first casualty when war comes is truth.” (Hiram W Johnson)
“Reality is merely an illusion, albeit a very persistent one.” (Albert Einstein)
“You can’t control what you can’t measure.” (Tom DeMarco)
Much heat and light are generated by the storm of technical failure. Business leaders become passionate about proximate causes to cataclysmic events, and are usually very good at finding anecdotes or individual data points to “prove” their cases as well as to “point fingers.” Off-hand or direct comments are made such as, “the systems are always down”, “performance is always slow”. “or the systems issues are impacting my P&L”.
There is nothing wrong in this, but this is only part of the story on how to discover the actual truth of a failure, dispassionately and using logic, data and facts. (Of course, the nature of reality is always in dispute, but I shall be practical here and not indulge in ontological philosophy).
When failure occurs, it is prudent to acknowledge it. You must determine the facts around it, and communicate with those affected. You must act upon a commitment to address the issue in a manner that is appropriate and in proportion to the severity of the failure. Ideally, you will have already established the trust of your business partner and have a) well established channels and modes of communication and b) tools to monitor system performance and can detect an outage or issue before your customer notices.
It is also important to develop your empathy, and understand the impact of a failure in business terms, using business metrics. True impact should be understood in terms of lost revenue, idle staff, overtime, payment deferrals, lost customers and brand damage. Do not downplay a bad situation. However, be prepared to distinguish a raindrop from a cloudburst: business operations folks tend to think “the sky is falling” even when an isolated incident is concerned.
If you lack a sense of proportion or empathy, and do not have facts to support your response, you will have little or no credibility. And everyone else’s perception becomes reality. No matter what you communicate, your peers will think the worst about you and your organization based on poor performance or, tragically for your career, the perception of poor performance.
Measurement is the key to your credibility as a technology executive. There is nothing like data and fact to defuse unnecessary emotion and to get people to focus on the issues or constraints. The trick is to measure the right things, and work to continually improve performance (operational, functional and financial).
Note also that despite what many skeptical staff may tell you, it really is possible to measure quite a lot of obscure things such as customer experience. If you have any doubts, please read Hubbard’s “How to Measure Anything: Finding the Value of ‘Intangibles’ in Business.” Or look up heuristics.
Imagine the scenario. Your operations team is focused on keeping the operations environment “up” and “green.” Any time a server goes down, the team is on top of the issue and replacing the hardware and/or troubleshooting the configuration of software. Unfortunately, even though all of the servers are “green” most of the time, your customer thinks the system performance is poor. Why is there a disconnect?
The answer is that the customer and operations teams are watching different parts of the same system, and not the system as a whole. Systems are complicated beasts and have a lot of components and inter-dependencies especially when server resources are shared. Without observing how the customer feels about a system, it is next to impossible for the operations team to be doing their jobs properly.
So, in practical terms, the opportunity comes to this:
1. Get on top of your key technology operations metrics. Use standard ones, like those provided by the Technology Business Council. E.g., application availability.
2. Divide your application portfolio into Tiers. The top Tiers include CSEs (Critical Shared Elements) and CBAs (Critical Business Applications) that have direct and substantial business impact to customers and, or internal users. Negotiate SLA’s for each tier with your business partner or your executive team that is achievable given constraints (cost, time, staff, technology). One tiering example is below, but realize that one size does not fit all.
- Tier 0 Critical shared elements or services that impact multiple applications or services and have immediate negative impact to customers or critical business transactions (e.g. DNS, customer databases, product catalog, customer or employee authentication, data centers, internet connectivity, etc.). Such applications need a high level of infrastructure and design to ensure Disaster Recovery, High Availability or Business Continuity.
- TIer 1 Critical business applications directly impacting revenue and brand (e.g. POS, eCommerce, billing, provisioning/activation, call center technologies, order taking and fulfillment). Such applications are likely to need a high level of infrastructure for Disaster Recovery.
- Tier 2 Applications whose unplanned interruption may be impactful, but doesn’t require the highest level of infrastructure investment.
- Tier 3 Everything other application (such as, reporting systems, training systems, non production environments).
3. Get to know your key services and systems transactions. Measure volume, latency and throughput. Figure out when resources are most taxed, and observe and model behavior when system elements become constrained. E.g., web page hits, logins, orders, calls to Customer Care, fulfillment times, activations.
4. Model and measure system performance from a customer perspective, and investigate causes of slowdown or sub-optimal performance (before such issues turn into outages!).
5. Make sure that the ops team and the customer measure what are important to each other, and are rewarded on combined success.
6. Use techniques such as Operational Science to attack chronic issues.
7. Understand your system as a series of models to predict behavior against scenarios, and compare actual performance. This is important to understand performance better than your customer does; customers tend to have a very empirical view of what it takes a system to function, and cannot therefore do very precise diagnostics or prescribe proper fixes.
8. Communicate with your business partners. Listen to the customer and observe behavior. Educate yourself about their business, and your partner on yours. Ensure that you measure performance in the same way, and reward behavior in the business and technologies teams that promote a common understanding, alignment and business purpose.
9. I would be remiss not to mention that the best time to develop and implement these key performance indicators is before a single line of code or server is implemented. Work with your business partners to understand what business processes are being supported by the technology, what are the most important aspects they care about in real-time and then partner with the development, vendor or engineering organizations to instrument the code to leave “bread crumbs” when transactions go left and not right or worse, don’t go at all.
I’ll be covering many of the relationship and technical problems in future posts. In the meantime, let me know if you have any questions.
Mike Ross <TechOpsExec@gmail.com>.
#techopsexec
