Another prominent company had a very public outage. And it directly or indirectly may have impacted your IRA, 401K and investment accounts by removing almost a ½ day worth of trading. Day traders must have been climbing the walls!
From the outside looking in, a compilation of facts, stories, and questions about the outage:
Brand Damage:
NASDAQ in Fresh Market Failure (WSJ) http://online.wsj.com/article/SB10001424127887324619504579028873794227410.html
http://www.usatoday.com/story/money/markets/2013/08/22/nasdaq-trading-freeze-reputation/2686883/
http://www.nbcnews.com/business/flash-freeze-halts-nasdaq-stock-trading-3-hours-6C10974922
Business Impact:
3,000 Stocks couldn’t be traded for up to 3 hours, including Apple, Facebook and Microsoft
Fines and Penalties
Loss of competitive position
Political Capital spent: Obama was informed of the outage. Financial institutions and politicians are questioning your capability and competence.
Incident Management:
Communications:
Some trading partners were not notified for two hours
Command and Control:
Were the right people informed, consulted, engaged in a timely manner? What decisions needed to be made by whom and when? Were the right options available? What information / data was true or red-herrings or missing? What was needed and wasn’t available?
Restore / Repair:
Reportedly glitch fixed / resolved in 30 minutes, but orderly recovery and coordination took 2 ½ hours. Was the communication and restoration plan already written? Is 2 ½ hours acceptable? What could be done to dramatically reduce the duration to restore full trading capability? Is there a restoration checklist? Was the checklist followed?
Technical and Operational Design:
Redundancy vs. Resiliency and Self-Healing
Compartmentalization and Isolation: Why would this failure stop trading?
Scenario Planning: was this incident ever viewed as possible?
If the connectivity is with a third-party, what dependence do we have on the stability of the third-party? What is missing in terms of operational relationship, capabilities, agreements, contracts, penalties, communications, processes with the third-party that is proportionate to the business impact they have with NASDAQ?
Monitoring:
Any degradation or early warning of impending problem? Were there any alarms at all? Or did they go from Green to Red in a heartbeat? Are we measuring just connectivity (link up/down) or transaction flow volume, success/failure and latency?
Change Management:
What changed? Who changed? NASDAQ or Partner?
Was it tested?
Root Cause
Oh yeah, what the hell happened? People, Process or Technology failure? In almost all cases, it is all of the above.
A twist on the last blog, how many of you are reviewing the roadkill of this outage and saying “That Could have been Us!” What are you doing about it? Or are you preparing 3 envelopes?
Update from Bloomberg on the technical aspects of the failure, though not necessarily root cause:
http://www.bloomberg.com/news/2013-08-26/nasdaq-three-hour-halt-highlights-vulnerability-in-market.html
If indeed a network box failure, it probably was not due to a lack of redundancy, but a lack of resiliency. Some large network router / switch vendors (to be nameless) have challenges with software failures that prevent traffic from flowing, but the box appears healthy from basic monitoring tools.Therefore, no trigger to automatic failover over to the redundant box which must be done manually.
Mike