The recent IT troubled launch of the National Health Care enrollment systems at the federal and state levels is a recent reminder that planning, design, architecture, and stress / performance testing are all critical elements to avoid front page news.
Analysis: IT experts question architecture of Obamacare website.,
Problems at main Obamacare website are being fixed: White House …, ,
Saturday Night Live even joked that it is the equivalent of 1-800 Flowers crashing on Valentine’s Day.
For video operators, their Super Bowl, is well, the Super Bowl. Phone/Wireless companies is Mother’s Day. Brick and Mortar retailers have Black Friday plus the five to six weeks until New Year’s. Online retailers have Cyber Monday, Health care providers have Open Enrollment, college and universities have Registration week. For Wall Street exchanges, every day is a Super Bowl (especially for any high profile IPO launches) and no one is happy when regulatory agencies and Congress gets involved when things go down. For the Top Technology Operations Executive (aka the BIG TOE), these are the moments where praise is lacking when everything goes well. When they don’t criticism and scrutiny abound and executive searchers begin.
Capacity Planning / Capacity Management is not new and not rocket science. One looks foolish when things tip over for a planned event. Even If expectations were set on what ceiling you have built and tested to sustain/support based on business forecasts you might be but are not necessarily safe.
Here are the 4 stupid questions for which the Big TOE should know the answers or have a process for getting the answers quickly:
- What applications / systems / databases / networks / services will be stressed due to the event or period of time?
- What is their theoretical, tested or historical production peak ceiling capacity?
- What is the forecasted load expected for the event?
- Are you OK or is there a gap and a funded plan to address?
Capacity Management is a good microcosm of the many areas that the Big TOE must master and have a solid discipline. Below is just a sample of areas that should be examined…
Assessment Process – What areas are:
- Red – high risk or unknown?
- Yellow – of concern – watch and have mitigation plan ready
- Green – performance and capacity tested or low risk area
History
- Last year’s volume and transaction rates and latency times
- What failed / disappointed last time?
- What transaction paths, interfaces, services, applications were introduced since the last peak period that have not been performance tested to volumes expected?
- Are there known design risk areas that staff have expressed concern about scalability that went unheeded due to pressure to get functionality into production?
Forecasts
- Sales Targets
- Product Mixes
- Product Launches
- Customer demand expectation
- What will the business be doing / asking that is new or different from last year?
Stress Points
- Pipes: Store network / data center WAN
- Log file space allocations, Database size and connection threads
- Middleware connections
- # of Servers, Caching systems
- Load Balancers and Firewalls
- People (Operations, Engineering, suppliers, vendors, command & control)
Performance / Capacity Testing Competency and Capability
- How confident are you with your testing?
- Environment: Replica or scaled down with extrapolation
- Did you test with expected load (and appropriate transaction mix)?
-
Right tools with the right people in the right way?
Redundancy & Resiliency Assessment
- “What if” scenario planning
- Fail-over and/or overflow readiness
Control Points / Monitoring / Response process / plan
- What are your levers to throttle the demand or allocate more resources quickly?
-
Do you know when the tipping point is approaching? How? Who knows and what processes are in place to inform and react?
Reporting – Executives sometimes ask during the event, how things are going from a business perspective. Can you answer? [By the way, will your tablet and reporting infrastructure support all the execs checking the real-time sales numbers?]
- Channels – Retail Stores, Dealers, Internet/Web
- Store, State, Region, Division, National, International hourly reporting
- Sales Dashboard and Mobile Access (smartphones / tablets)
Change / Release Management:
- What got introduced into production since you started / ended your performance tested at what risk?
If you have concerns about the ability to support Black Friday / Cyber Monday / Super Bowl / Mother’s Day / Valentine’s Day or just plain fantastic business demand , I hope you have enough calendar and budget runway to address!