Building Automated AGs or FCIs for Production with Azure-based Methods Is Not a Good Idea For Now
I was working on some things this past weekend for a few upcoming projects, and one of those involves Azure and automation. Anyone who knows me is that I will praise when necessary, and call out when something is not quite right. Microsoft’s Azure, Windows Server, and SQL Server teams earned my “What Were They Thinking?” badge.
What Is This About, Allan?
Microsoft has published a few solutions to automagically build AG solutions for you (there are none for FCIs right now) up in Azure using IaaS VMs. The one below has been around for some time and is easily found in the Portal.
Let me digress for a moment and say how this solution in my opinion is not quite kosher for a production deployment:
- AlwaysOn is not the feature name. Always On Availablity Groups, or just Availability Groups is the name of the feature. Always On has had a space now going on nearly five years. I may have a few blog posts about this somewhere 😉 (here is one example)
- No customer I know is going to build separate Active Directory Domain Services (AD DS) servers just for an AG; they’re going to have existing ones that they will use.
- The template only allows you to select Developer or Enterprise Editions of SQL Server, not Standard. Since this is a two-node only configuration, I’m not sure why this was not updated for SQL Server 2016 and later.
- You cannot choose what kind of load balancer thet gets created for the AG listener.
- No load balancer is created for the WSFC.
- Some regions now have Availability Zones (AZs) which is better than Availability Sets (AS). The template has not been updated to reflect that.
- Cloud witness! ’nuff said. Building a FSW here is totally valid, but this is an Azure solution. This was built pre-WIndows Server 2016 which is when cloud witness was introduced.
I ran the template and it took 1 hour, 1 minute, and 52 seconds to complete.
In theory, this particular template is an ok(ish) solution if you want to kick the tires on AGs in a non-production way and see what they are all about without purely from an AG perspective. However, this solution they put together is VERY old (I think about five years at this point) and outdated, not to mention people generally deploy AGs after they have databases. If Microsoft wants people to use this, they should update it to reflect a more modern architecture and have the ability to use things like Standard Edition, AZs, and cloud witness.
More recently, Microsoft released a few new things they blogged about the past few months: Automate Always On availability group deployments with SQL Virtual Machine resource provider from December 2018 and Simplify Always On availability group deployments on Azure VM with SQL VM CLI from February 2019.
The workflows for the last two links are a bit … odd. It’s just so much easier to create the WSFC in guest and it solves the major problem I’m about to describe below which prompted this post. If I’m already in the guest, outside of needing to do any load balancer stuff, why would I do stuff in Azure? It’s not really easier and you probably already have PowerShell, T-SQL, or other scripts to do most of this. Some of this feels like a solution looking for a problem that doesn’t really exist. Choice is good but …
Sound off in the comments if you agree or disagree. I’m curious to see what people think.
The Real Problem
I had a good look at the Desired State Configuration (DSC) module for Windows Server Failover Clusters (WSFCs) which is called xFailOverCluster. This is mostly the heart of the matter. The latest version as of this blog post is 1.12.0.0. Specifically, I was seeing what it could and could not do, and there is one major chunk missing from it: validation. The big Azure template I complain about above also does not run validation. Why is this a bad thing?
Look at Microsoft KB327518 “The Microsoft Support Policy for Clustered Configurations of SQL Server with Windows Server” . That links to KBx “2775067” The Microsoft support policy for Windows Server 2012 or Windows Server 2012 R2 failover clusters”. That KB also applies to Windows Server 2016 and 2019. Focus on this line:
“The fully configured failover cluster passes all required failover cluster validation tests. To validate a failover cluster, run the Validate a Configuration Wizard in the Failover Cluster Manager snap-in, or run the Windows PowerShell cmdlet Test-Cluster.”
What does this mean? To have a supported WSFC-based configuration (doesn’t matter what you are running on it – could be something non-SQL Server), you need to pass validation. xFailOverCluster does not allow this to be run. You can create the WSFC, you just can’t validate it. The point from a support view is that the WSFC has to be vetted before you create it. Could you run it after? Sure, but you still have no proof you had a valid configuration to start with which is what matters. This is a crucial step for all AGs and FCIs, especially since AGs do not check this whereas the installation process for FCIs does.
If you look at MSFT_xCluster, you’ll see what I am saying is true. It builds the WSFC without a whiff of Test-Cluster. To be fair, this can be done in non-Azure environments, too, but Microsoft givs you warnings not to do that for good reason. I understand why Microsoft did it this way. There is currently no tool, parser, or cmdlet to examine the output of Test-Cluster results. This goes back to why building WSFCs is *very* hard to automate.
Knowing this, I would change all of this to build the AG (or FCI) VMs with the Failover Clustering feature enabled, then validate and build the WSFC inside similar to what is in the workflow for building the AG on your own. So it’s still a mix of automation and some minor human intervention.
MSFT_xCluster also has another issue in my mind in parsing the code: it seems like it only handles Active Directory Domain Services (AD DS)-based WSFCs. If you wanted to build a Workgroup Cluster variant of a WSFC that does not require AD DS, you are out of luck. This is acknowledged in that MS blog post from February I link above, and at least they call it out. Kudos.
We only support AD domain joined Windows Failover Cluster definition. The FQDN is a must have property and all AG replicas should already be joined to the AD domain before they are added to the cluster.
All of this feels a bit like a case of fire, ready, aim, or more specifically – deploy, understand supportability, automate.
Can You Still Automate AG Deployments Using What MS Provided?
If you are looking at non-production environments such as development and QA, use anything and everything I criticize above since supportability generally is not an issue there. You’re not deploying production systems in the truest sense (i.e. end user/customer facing), but keep in mind they are production systems for your developers and testers.
If you build the base Windows Server IaaS VMs and get through validation and want to automate beyond that, you’d have a fully supported solution if building the WSFC and AG portions are fully automated.
That said, if you know what you’re doing, building all of this yourself won’t take much more time and may even take less time – especially the WSFC piece. You can automate it yourself in different ways. Building a WSFC really does just work these days when it’s done right (kudos to the Windows Server dev team and the HA PMs). Do what works for you; if what Microsoft provides works for you, go nuts. Just know there’s more than one way to approach this problem.
The Bottom Line and What Microsoft Needs to Do
Automation has come a long way but we’re not there fully there yet for clustered and supported configurations of SQL Server running on Windows Server up in Azure or any of the public clouds for that matter. Here’s what needs to happen:
- Fix things so that Test-Cluster is run and the output is checked before building the WSFC and the AG.
- Should Microsoft deem it acceptable to support these automated methods already out there for production builds, they need to say that somewhere other than a blog post officially AND update KB2775067 accordingly that the validation requirement is waived. Otherwise there will be conflicting information out there which is bad for everyone including Microsoft. Microsoft needs to stop that nonsense right in its tracks.
- Update any templates and Wizards accordingly.
When and if these things happen, by all means, automate away in Azure even for production!
Need help with your availability solutions, especially if you are looking at any of the public clouds? Contact us today and we can kickstart your projects into high gear.