Clusters Do Not Replace a Proper Backup and Restore Strategy
I need to address a common misconception in this blog post. To paraphrase, this is what I have heard from many for the past 20+ years:
If I have a Windows Server Failover Cluster/Availability Group/Failover Cluster Instance, I do not need backups.
The Rackspace Ransomware Incident
A few weeks ago, Joey D’Antoni (Blog | Twitter) wrote an article for Redmond Magazine about the recent Rackspace ransomware incident. The bit from the article stuck out at me and the genesis of this blog post was the following quote from Josh Prewitt, Rackspace’s Chief Product Officer:
“The way the environment is architected is it takes advantage of the native clustering that’s built into Exchange. We’ve got multiple copies of everything, and Exchange is going to naturally distribute that out to other servers within the cluster. And so everything would have had at least three copies, depending on the datacenter that was in.”
System Design for Availability and Resiliency
Redundancy is one of the keys to availability/resiliency from a technology standpoint. In that aspect of his statement, Prewitt is 100% correct. Multiple copies of data is good. However, the whole solution must be properly architected end-to-end. The solution is based on quicksand if something is amiss underneath. Inside the very top layer, things may look and feel fine even when they are not. A good example is a set of virtual machines that inside are configured as an Always On Availabilty Group (AG). The AG may be functioning properly but the person responsible for its administration may have no idea the infrastructure underneath is crap. I refer to this as “the illusion of availability”.
A good example of the illusion of availability is that all of your “redundant” systems (physical or virtual) are connected to the same storage array. That storage array is now a single point of failure. This exact scenario is something I’ve seen many times in my life as a consultant. I do not know how Exchange is architected at Rackspace so I cannot comment on how it is architected nor will I speculate.
An aspect that people do not often link to availability is security. What happened to Rackspace is not unique. Ransomware is infiltrating companies, governments (local or otherwise), and more on an increasingly frequent basis for the past few years. Ransomware is a topic for another time.
Not only does proper end-to-end system design matter, but also understanding what you are implementing is important.
Clustering Exchange
There are two forms of clustering in Windows Server: network load balancer (NLB) and one for availablity called a Windows Server Failover Cluster (WSFC). A WSFC is the underpinning of both of SQL Server’s Always On features, AGs and Failover Cluster Instances (FCIs), as well as Exchange’s database availability groups (DAGs). Do not confuse Exchange DAGs with AGs or its variant, distributed AGs. Unfortunately, some call distributed AGs a DAG as well (please don’t do that).
Features work how they are designed. An AG, DAG, or FCI may not be able to fail over to another node depending on various factors. This means these features will not solve all your issues nor initiate world peace.
From the quote I used above, it seems like Prewitt may not have done a deep dive into what a DAG may or may not protect. I cannot tell you how many times I’ve had conversations with customers over the years where they assumed how AGs and/or FCIs (and by association, WSFCs) worked and I had to set them straight. Were they happy? No. I cannot say this is 100% the case with the Rackspace incident, but it feels that way to me.
Always Plan For The Worst Case Scenario
Let me be clear: features like Exchange’s DAG or SQL Server’s AGs and FCIs are great. Implement one (or more) if it is the right fit for your availability/business continuity needs. Rackspace was not incorrect in deploying an Exchange DAG.
If the event is not catastrophic, an automatic or manual failover from a DAG, AG, or FCI will get you up and running quickly. If a catatastrophic event occurs, all bets are off. At that point, the only thing that may save you is good ol’ backup and restore.
When using third party providers, check their terms of service (TOS). A TOS is a legal, not a technical, document. Make sure the TOS meets your company’s needs. For example, Rackspace’s Mail Hosting Terms documents the mail service’s service level agreement (SLA). They do make provisions if maintenance will take more than 20 minutes. However, the onus is on the customer to know the longer maintenance window is happening by checking Rackspace’s site.
The Bottom Line
First and foremost, tested backups and a plan to restore them is the fundamental building block of any availability (and ransomware) strategy. This is true no matter what cool feature you implemented in the platforms you use. Ensure the backups are stored elsewhere, too.
Be able to measure backup and restore success. Do you have documented recovery time objectives (RTOs) and recovery point objectives (RPOs)? If not, start the process to document them today. You may need different RTOs and RPOs depending on th type of downtime event. These goals need to balance things such as staffing, skills, the needs of the business, cost, and more.
Customers put trust in companies where they are consuming services or products. Rackspace hopefully learned from this incident and will have a better Plan B.
Finally, Another aspect which I discussed in my previous post which covered the Southwest Airlines woes over the holidays in December 2022 is that Rackspace could potentially take hits here to both reputation and their bottom line. If customers had to migrate to other mail platforms because of this incident, will they come back?
Stay safe out there!