Is an Outage Just an Outage?
I hear it all the time from customers that if they had unlimited budget like “the big guys”, they’d never have any issues or outages. Bzzzt! Thanks for playing – there will be some consolation prizes near the door for you.
First and foremost, even “the big guys” have budgets and other constraints. But that’s only part of the problem. Let me give you a great example. If you haven’t seen, United Airlines has had some, er, minor computer issues the past two days that has caused both passengers and the airline a lot of agita. We’re not talking a mom-and-pop corporation here; this is a major airline with hubs that deals with tens of thousands of passengers each day. The Flyertalk thread on this incident is a fun read. I just saw this:
Just a FYI for the anti-SHARES hysteria crowd. I just received word via an internal communication that the cause of the outage wasn’t SHARES, nor did it have anything to do with SHARES.
The cause was communication equipment in the data center that failed. This was a hardware problem, not software.
Now having said that, I am curious to know why there were no redundant systems in the data center or a quick (ie instant) method of re-routing network traffic via a different data center.
United has also said that it was a hardware problem. I loved this:
The problem was a piece of hardware in a data center that failed to communicate
properly with other computer equipment, said Megan McCarthy, a spokeswoman for
United Continental Holdings Inc. A backup system failed to take over for the
*cough* Wonder if they ever tested it? *cough* I’ll say it again for the 1,000,000,000,000,000th time – testing is important. Need I harp on this any more?
Hardware or software, what is in that linked article is telling:
United has been struggling with technology problems since March, when it
switched to a passenger information computer system that was previously used by
Continental. United and Continental merged in 2010. That system, called
“Shares,” has needed extensive reworking since March to make it easier for
workers to use.
That’s more than just a hardware problem that happened to occur. It’s a people, process, and technology issue. When two companies merge, problems like this occur all the time. You pick the software that will be used and the other will need to conform to it. Good example: Company A uses Lotus Notes (and yes, I’ve worked with customers still using Notes). Company B uses Microsoft Exchange. Company A buys Company B. Guess what the new corporate e-mail system is? Company B’s employees not only need to be trained and have Notes rolled out to them (a daunting set of tasks all by their lonesome), but all of their mail needs to be migrated to Notes, etc. That process is a potential disaster waiting to happen if done wrong. Unproductive and disgruntled workers is a side effect of bad decisions and process.
Clearly it seems like when CO and UA merged, the decision was made to use CO’s “stuff”. That’s all well and good. When you affect not only your internal folks but passengers – including delayed, cancelled, and otherwise affected flights where you are doing things like handwriting boarding passes, you’ve got serious, serious issues. I wonder if handwriting boarding passes was in their D/R plans?
I also talk about cost of downtime – clearly there was some here. On top of the normal stuff we all know about, an airline sometimes has to compensate passengers for things like hotel stays, free food in hotels – that’s not free to the airline. Add to that the disruption in travel for folks who may miss connections that have other downstream effects (example: missed vacations such as leaving for a cruise trip), and it’s a nightmare all around.
One of the big guys – United in this case – has more money than the mom-and-pop business down the street. But they struggle just like everyone else. Let this be a cautionary tale of availability and disaster recovery, and that there’s more to it than just technology.