downtime Archives

December 29, 2022

A Lesson for IT – Don’t Be Southwest Airlines

Whether you live in the United States or not, by now you have probably heard about what is going on (or not, as the case may be) with Southwest Airlines (SWA). I was away for the holiday weekend visiting a friend and on the way back, even the employees of the other airline I was flying were talking about it and how the systems had basically melted down. As I say often, you want to be reading the news, not making the headlines and certainly not drawing the attention and ire of the US Department of Transportation (See their Twitter posts starting around December 26. Some examples: this thread, this thread, this Tweet, and this thread.).

What Happened?

From the outside looking in as someone who is a business continuity expert, this seems like it was a perfect storm of bad things converging at the same time. In the past week they have cancelled over 5,000 flights leaving passengers stranded, angry, and often in bad scenarios. Why are/were things so bad? Three quick points:

Mother Nature and the storms/weather that hit the US around the holiday and affected some of their busiest airports.
Not having a hub and spoke model for planes and people to be able to easily move the chess pieces around for things like weather events. This speaks to process and configuration as we think about it in IT.
Fragile, legacy IT systems still involved in day-to-day operations. In the case of SWA, there are systems that deals with flight and crew management. This problem is our good friend technical debt.

I feel bad for everyone involved – the customers affected, the employees who have to deal with the situation (especially the frontline ones who will feel the brunt of the customer wrath), and everyone in between.

Let me be clear: I don’t think SWA sought out to ruin people’s travel around a holiday nor was it their goal to draw the attention of the US Government. This is the reality of business and IT – things happen, often at inconvenient times. People are affected by said event. Hilarity does not ensue.

Let’s Talk Technical Debt

You’re never a hero proverbially saving $1 now when it will cost you $10 to deal with whatever that problem is later. Kicking the can down the road is a flawed, dangerous IT strategy. I’ve addressed tech debt and other related issues before (selected posts: “Technical Debt – The (Not So) Silent Crisis“, “Outages In An Increasingly Connected World“, “Security Is An Availability Problem“, and “Another Day, Another Outage“) so if you want to know the basics in more detail, read those.

SWA did upgrade some systems a few years back to give “the carrier more flexibility to improve the Customer Experience and enhance revenue performance.” Clearly the “Customer Experience” has been top notch over the past week. When you’re not flying and have to reimburse customers and figure things out, you LOSE money, not enhance revenue performance.

Availability goals should always be based in reality with real world data. How much does downtime cost the business – literally? What penalties – financial or otherwise – will be incurred? Does our solution mitigate those risks? It seems as if SWA either did not properly assess risk or worse, care. If it ain’t broke, don’t fix it, right? Wrong. According to this CNN report, SWA underinvested in its operations. Basic communication – including phone systems – were not working. Communication is crucial when the excrement hits the fan.

Andrew Watterson, SWA’s Chief Operating Officer, blamed the outdated scheduling software in a company call. The quotes from the call in the CNN article are telling.

I get that for large companies it’s hard to rip out existing systems, especially when you cannot tolerate much – if any – downtime. I spent the better part of the past 25+ years helping customers architect solutions (and will continue to do so at Pure) that perform well, are secure, and resilient/highly available. Choices have consequences and customers need to make the right ones especially when sunsetting older solutions that are very important. Always be looking forward.

How Do You Avoid Technical Debt?

I have worked with enough customers over the years to know that most people reading this blog have at least one legacy system hanging around. You know the one. It’s that system that if you look at it sideways, it acts up. That’s the one (or ones) you need a plan for sooner rather than later.

Being honest, tech debt is hard to avoid 100% of the time but you need to try. Be proactive, not reactive. Know when things like SQL Server, Windows Server, and other third party software are out of support. There are many nuances to dealing with technical debt which also includes ensuring that all staff has training and their skills are modernized. Technical debt is a people issue, too.

Know your core functionality and what you need to achieve. Getting lost in whiz-bang, fancy features and analytics does not mean a hill of beans if your company’s core goals are not met. In the case of SWA, they cannot move people from Point A to Point B. Let’s not even get into the potential hit to their reputation and bottom line that comes along with a failure of this magnitude.

Don’t become the next headline. Planning for obsolescence as soon as a system is brought online is really the exercise that needs to happen. If you do not bake obsolescence in as a feature from day one, you may be the next SWA or worse; events like this can take the business out permanently, too. Unemployment is not the goal.

What are your thoughts? Have you been in similar situations and if so, how did you get past the issue(s)?

February 8, 2019

Another Day, Another Outage

Disaster recovery is in the news this week for all the wrong reasons.

Wells Fargo

Stop me if you’ve heard this story before. A major company – in this case a financial institution – is having a technical outage for not only the first, but the second time in less than a week. Assuming you’re not working for Wells Fargo or one of their customers, chances are had a much better time these past few days. These are their recent tweets as of sometime in the afternoon Thursday, February 7. This post went live on Friday morning the 8th and Wells Fargo is still down – basically two days of downtime.

Figure 1. Wells Fargo acknowledging the problem

Ouch. The last tweet is not dissimilar to a British Airways outage back in 2017. This tweet from Wells Fargo claims it is not a cybersecurity attack. Right now that’s still not a lot of comfort to anyone affected.

Figure 2. It’s not a hack!

Their customers are not enjoying the outage, either. Three examples from Twitter:

Figure 3. Customer issue due to the outage

Figure 4. Day two …

Figure 5. More impact

Imagine the fallout: direct deposit from companies may not work, which means people do not get paid. People can’t access their money and do things like pay bills.For some, this could have a lifelong impact on things like credit ratings if you miss a loan, mortgage, or credit card payment. There is not just the business side of this incident.

There were of course, snarky replies about charging Wells Fargo for fees. I wanted to call out an honest-to-goodness impact to someone along with the possible longer term fallout to Wells Fargo themselves. I feel for the person who tweeted, but the reality is Chase or Citi could have an outage for some reason, too. I don’t think anyone is 100% infallible. I have no knowledge of what those other finanical institutions have in place to prevent an outage.

I’m doing what I do now largely because I lived through a series of outages similar to what Wells Fargo is most likely experiencing this week. Those outages happened over the course of about three months. They are painful for all involved. I worked many overnight shifts, a few 24+ hour days … well, you get the picture. I learned a lot during that timeframe, and one of the biggest lessons learned is that not only do you need to test your plans, but you have to be proactive by building disaster recovery into your solutions from day one. All companies, whether you are massive like Wells Fargo or a small shop has the same issues. The major difference is economy of scale.

Everyone has to answer this one question: how much does downtime cost your business – per minute, hour, day, week? That will guide your solution. Two days in a row of a bank being down is … costly.

There may be another impact, too – can Wells Fargo systems handle the load that will happen when the systems are online again? I bet it will be like a massive 9AM test. Stay tuned!

You don’t want to have a week like Wells Fargo. Take your local availablity as well as your disaster recovery strategy seriously. I’ve always found the following to be true: most do not want do put in place proper diaster recovey until their first major outage. Unfortunately at that point, it’s too late. Once the dust settles, they’ll suddenly buy into the religion of needing disaster recovery. Let me be clear – I do not know what the situation is over at Wells Fargo; zero inside information here. Did they have redundant data centers and the failover did not work? Were some systems redudnant but not others? Were there indications long before the downtime event that they missed? There are more questions than answers, and I’m sure we’ll find out in good time what happened. The truth always comes out.

A Different Kind of Mess – Quadriga

This week also saw a very different problem as it relates to finance: the death of Quadriga CEO Gerald Cotten. Why is his passing away impactful? When he died, apparently he was the only one who knew or had the password for the cryptocurrency vault. There are other alleged issues I won’t get into, but with his laptop encrypted (apparently his wife tried to have it cracked), literally no one whose money was in the vault can get it. The amount stored in there I’ve seen in various stories has been different, but it is well north of $100 million. The lesson learned here is that someone else always needs to know how to access systems and where keys are. Stuff happens – including death. There are real world impacts that can happen when systems cannot be accessed.

Watch Your Licenses

Since we are talking about outages, the Register published a story today about how a system would not come up after routine maintenance due to the software license expiring. I have been through this with a customer. We were in the middle of a data center (or centre, for you non-US folks) migration. We had to reboot a system and SQL Server would not come up. I looked in the SQL Server log. Lo and behold, someone had installed Evaluation Edition and never converted it to a real life. It also meant the system was never patched and never rebooted for a few years! Needless to say, there was no joy in mudville. That was a very different kind of outage.

The Bottom Line

If you do not want to be another disaster recovery statistic and prevent things like the above from happening, contact SQLHA today to figure out where you are and where you need to be.

What Happened?

Let’s Talk Technical Debt

How Do You Avoid Technical Debt?

Popular Posts

Categories