Business Continuity Archives

January 5, 2023

Clusters Do Not Replace a Proper Backup and Restore Strategy

I need to address a common misconception in this blog post. To paraphrase, this is what I have heard from many for the past 20+ years:

If I have a Windows Server Failover Cluster/Availability Group/Failover Cluster Instance, I do not need backups.

The Rackspace Ransomware Incident

A few weeks ago, Joey D’Antoni (Blog | Twitter) wrote an article for Redmond Magazine about the recent Rackspace ransomware incident. The bit from the article stuck out at me and the genesis of this blog post was the following quote from Josh Prewitt, Rackspace’s Chief Product Officer:

“The way the environment is architected is it takes advantage of the native clustering that’s built into Exchange. We’ve got multiple copies of everything, and Exchange is going to naturally distribute that out to other servers within the cluster. And so everything would have had at least three copies, depending on the datacenter that was in.”

System Design for Availability and Resiliency

Redundancy is one of the keys to availability/resiliency from a technology standpoint. In that aspect of his statement, Prewitt is 100% correct. Multiple copies of data is good. However, the whole solution must be properly architected end-to-end. The solution is based on quicksand if something is amiss underneath. Inside the very top layer, things may look and feel fine even when they are not. A good example is a set of virtual machines that inside are configured as an Always On Availabilty Group (AG). The AG may be functioning properly but the person responsible for its administration may have no idea the infrastructure underneath is crap. I refer to this as “the illusion of availability”.

A good example of the illusion of availability is that all of your “redundant” systems (physical or virtual) are connected to the same storage array. That storage array is now a single point of failure. This exact scenario is something I’ve seen many times in my life as a consultant. I do not know how Exchange is architected at Rackspace so I cannot comment on how it is architected nor will I speculate.

An aspect that people do not often link to availability is security. What happened to Rackspace is not unique. Ransomware is infiltrating companies, governments (local or otherwise), and more on an increasingly frequent basis for the past few years. Ransomware is a topic for another time.

Not only does proper end-to-end system design matter, but also understanding what you are implementing is important.

Clustering Exchange

There are two forms of clustering in Windows Server: network load balancer (NLB) and one for availablity called a Windows Server Failover Cluster (WSFC). A WSFC is the underpinning of both of SQL Server’s Always On features, AGs and Failover Cluster Instances (FCIs), as well as Exchange’s database availability groups (DAGs). Do not confuse Exchange DAGs with AGs or its variant, distributed AGs. Unfortunately, some call distributed AGs a DAG as well (please don’t do that).

Features work how they are designed. An AG, DAG, or FCI may not be able to fail over to another node depending on various factors. This means these features will not solve all your issues nor initiate world peace.

From the quote I used above, it seems like Prewitt may not have done a deep dive into what a DAG may or may not protect. I cannot tell you how many times I’ve had conversations with customers over the years where they assumed how AGs and/or FCIs (and by association, WSFCs) worked and I had to set them straight. Were they happy? No. I cannot say this is 100% the case with the Rackspace incident, but it feels that way to me.

Always Plan For The Worst Case Scenario

Let me be clear: features like Exchange’s DAG or SQL Server’s AGs and FCIs are great. Implement one (or more) if it is the right fit for your availability/business continuity needs. Rackspace was not incorrect in deploying an Exchange DAG.

If the event is not catastrophic, an automatic or manual failover from a DAG, AG, or FCI will get you up and running quickly. If a catatastrophic event occurs, all bets are off. At that point, the only thing that may save you is good ol’ backup and restore.

When using third party providers, check their terms of service (TOS). A TOS is a legal, not a technical, document. Make sure the TOS meets your company’s needs. For example, Rackspace’s Mail Hosting Terms documents the mail service’s service level agreement (SLA). They do make provisions if maintenance will take more than 20 minutes. However, the onus is on the customer to know the longer maintenance window is happening by checking Rackspace’s site.

Figure 1. Rackspace’s documented SLA for mail

The Bottom Line

First and foremost, tested backups and a plan to restore them is the fundamental building block of any availability (and ransomware) strategy. This is true no matter what cool feature you implemented in the platforms you use. Ensure the backups are stored elsewhere, too.

Be able to measure backup and restore success. Do you have documented recovery time objectives (RTOs) and recovery point objectives (RPOs)? If not, start the process to document them today. You may need different RTOs and RPOs depending on th type of downtime event. These goals need to balance things such as staffing, skills, the needs of the business, cost, and more.

Customers put trust in companies where they are consuming services or products. Rackspace hopefully learned from this incident and will have a better Plan B.

Finally, Another aspect which I discussed in my previous post which covered the Southwest Airlines woes over the holidays in December 2022 is that Rackspace could potentially take hits here to both reputation and their bottom line. If customers had to migrate to other mail platforms because of this incident, will they come back?

Stay safe out there!

December 29, 2022

A Lesson for IT – Don’t Be Southwest Airlines

Whether you live in the United States or not, by now you have probably heard about what is going on (or not, as the case may be) with Southwest Airlines (SWA). I was away for the holiday weekend visiting a friend and on the way back, even the employees of the other airline I was flying were talking about it and how the systems had basically melted down. As I say often, you want to be reading the news, not making the headlines and certainly not drawing the attention and ire of the US Department of Transportation (See their Twitter posts starting around December 26. Some examples: this thread, this thread, this Tweet, and this thread.).

What Happened?

From the outside looking in as someone who is a business continuity expert, this seems like it was a perfect storm of bad things converging at the same time. In the past week they have cancelled over 5,000 flights leaving passengers stranded, angry, and often in bad scenarios. Why are/were things so bad? Three quick points:

Mother Nature and the storms/weather that hit the US around the holiday and affected some of their busiest airports.
Not having a hub and spoke model for planes and people to be able to easily move the chess pieces around for things like weather events. This speaks to process and configuration as we think about it in IT.
Fragile, legacy IT systems still involved in day-to-day operations. In the case of SWA, there are systems that deals with flight and crew management. This problem is our good friend technical debt.

I feel bad for everyone involved – the customers affected, the employees who have to deal with the situation (especially the frontline ones who will feel the brunt of the customer wrath), and everyone in between.

Let me be clear: I don’t think SWA sought out to ruin people’s travel around a holiday nor was it their goal to draw the attention of the US Government. This is the reality of business and IT – things happen, often at inconvenient times. People are affected by said event. Hilarity does not ensue.

Let’s Talk Technical Debt

You’re never a hero proverbially saving $1 now when it will cost you $10 to deal with whatever that problem is later. Kicking the can down the road is a flawed, dangerous IT strategy. I’ve addressed tech debt and other related issues before (selected posts: “Technical Debt – The (Not So) Silent Crisis“, “Outages In An Increasingly Connected World“, “Security Is An Availability Problem“, and “Another Day, Another Outage“) so if you want to know the basics in more detail, read those.

SWA did upgrade some systems a few years back to give “the carrier more flexibility to improve the Customer Experience and enhance revenue performance.” Clearly the “Customer Experience” has been top notch over the past week. When you’re not flying and have to reimburse customers and figure things out, you LOSE money, not enhance revenue performance.

Availability goals should always be based in reality with real world data. How much does downtime cost the business – literally? What penalties – financial or otherwise – will be incurred? Does our solution mitigate those risks? It seems as if SWA either did not properly assess risk or worse, care. If it ain’t broke, don’t fix it, right? Wrong. According to this CNN report, SWA underinvested in its operations. Basic communication – including phone systems – were not working. Communication is crucial when the excrement hits the fan.

Andrew Watterson, SWA’s Chief Operating Officer, blamed the outdated scheduling software in a company call. The quotes from the call in the CNN article are telling.

I get that for large companies it’s hard to rip out existing systems, especially when you cannot tolerate much – if any – downtime. I spent the better part of the past 25+ years helping customers architect solutions (and will continue to do so at Pure) that perform well, are secure, and resilient/highly available. Choices have consequences and customers need to make the right ones especially when sunsetting older solutions that are very important. Always be looking forward.

How Do You Avoid Technical Debt?

I have worked with enough customers over the years to know that most people reading this blog have at least one legacy system hanging around. You know the one. It’s that system that if you look at it sideways, it acts up. That’s the one (or ones) you need a plan for sooner rather than later.

Being honest, tech debt is hard to avoid 100% of the time but you need to try. Be proactive, not reactive. Know when things like SQL Server, Windows Server, and other third party software are out of support. There are many nuances to dealing with technical debt which also includes ensuring that all staff has training and their skills are modernized. Technical debt is a people issue, too.

Know your core functionality and what you need to achieve. Getting lost in whiz-bang, fancy features and analytics does not mean a hill of beans if your company’s core goals are not met. In the case of SWA, they cannot move people from Point A to Point B. Let’s not even get into the potential hit to their reputation and bottom line that comes along with a failure of this magnitude.

Don’t become the next headline. Planning for obsolescence as soon as a system is brought online is really the exercise that needs to happen. If you do not bake obsolescence in as a feature from day one, you may be the next SWA or worse; events like this can take the business out permanently, too. Unemployment is not the goal.

What are your thoughts? Have you been in similar situations and if so, how did you get past the issue(s)?