Operational Failure Forensics (OFF): NASDAQ’s Thursday’s “Glitch”:

Another prominent company had a very public outage. And it directly or indirectly may have impacted your IRA, 401K and investment accounts by removing almost a ½ day worth of trading. Day traders must have been climbing the walls!

From the outside looking in, a compilation of facts, stories, and questions about the outage:

Brand Damage:

NASDAQ in Fresh Market Failure (WSJ) http://online.wsj.com/article/SB10001424127887324619504579028873794227410.html

http://www.usatoday.com/story/money/markets/2013/08/22/nasdaq-trading-freeze-reputation/2686883/

http://www.nbcnews.com/business/flash-freeze-halts-nasdaq-stock-trading-3-hours-6C10974922

Business Impact:

3,000 Stocks couldn’t be traded for up to 3 hours, including Apple, Facebook and Microsoft

Fines and Penalties

Loss of competitive position

Political Capital spent: Obama was informed of the outage. Financial institutions and politicians are questioning your capability and competence.

Incident Management:

Communications:

Some trading partners were not notified for two hours

Command and Control:

Were the right people informed, consulted, engaged in a timely manner? What decisions needed to be made by whom and when? Were the right options available? What information / data was true or red-herrings or missing? What was needed and wasn’t available?

Restore / Repair:

Reportedly glitch fixed / resolved in 30 minutes, but orderly recovery and coordination took 2 ½ hours. Was the communication and restoration plan already written? Is 2 ½ hours acceptable? What could be done to dramatically reduce the duration to restore full trading capability? Is there a restoration checklist? Was the checklist followed?

Technical and Operational Design:

Redundancy vs. Resiliency and Self-Healing

Compartmentalization and Isolation: Why would this failure stop trading?

Scenario Planning: was this incident ever viewed as possible?

If the connectivity is with a third-party, what dependence do we have on the stability of the third-party? What is missing in terms of operational relationship, capabilities, agreements, contracts, penalties, communications, processes with the third-party that is proportionate to the business impact they have with NASDAQ?

Monitoring:

Any degradation or early warning of impending problem? Were there any alarms at all? Or did they go from Green to Red in a heartbeat? Are we measuring just connectivity (link up/down) or transaction flow volume, success/failure and latency?

Change Management:

What changed? Who changed? NASDAQ or Partner?

Was it tested?

Root Cause

Oh yeah, what the hell happened? People, Process or Technology failure? In almost all cases, it is all of the above.

A twist on the last blog, how many of you are reviewing the roadkill of this outage and saying “That Could have been Us!” What are you doing about it? Or are you preparing 3 envelopes?

One thought on “Operational Failure Forensics (OFF): NASDAQ’s Thursday’s “Glitch”:

  1. Update from Bloomberg on the technical aspects of the failure, though not necessarily root cause:
    http://www.bloomberg.com/news/2013-08-26/nasdaq-three-hour-halt-highlights-vulnerability-in-market.html

    If indeed a network box failure, it probably was not due to a lack of redundancy, but a lack of resiliency. Some large network router / switch vendors (to be nameless) have challenges with software failures that prevent traffic from flowing, but the box appears healthy from basic monitoring tools.Therefore, no trigger to automatic failover over to the redundant box which must be done manually.
    Mike

Leave a comment