Blog

Building Automated AGs or FCIs for Production with Azure-based Methods Is Not a Good Idea For Now

By: on February 11, 2019 in Always On, AlwaysOn, Automation, Availability Groups, Azure, FCI, SQL Server, Template, Windows Server Failover Cluster | 1 Comment

I was working on some things this past weekend for a few upcoming projects, and one of those involves Azure and automation. Anyone who knows me is that I will praise when necessary, and call out when something is not quite right. Microsoft’s Azure, Windows Server, and SQL Server teams earned my “What Were They Thinking?” badge.

What Is This About, Allan?

Microsoft has published a few solutions to automagically build AG solutions for you (there are none for FCIs right now) up in Azure using IaaS VMs. The one below has been around for some time and is easily found in the Portal.

Figure 1. Azure Template for Creating a Full AG Solution

Let me digress for a moment and say how this solution in my opinion is not quite kosher for a production deployment:

  • AlwaysOn is not the feature name. Always On Availablity Groups, or just Availability Groups is the name of the feature. Always On has had a space now going on nearly five years. I may have a few blog posts about this somewhere 😉 (here is one example)
  • No customer I know is going to build separate Active Directory Domain Services (AD DS) servers just for an AG; they’re going to have existing ones that they will use.
  • The template only allows you to select Developer or Enterprise Editions of SQL Server, not Standard. Since this is a two-node only configuration, I’m not sure why this was not updated for SQL Server 2016 and later.
  • You cannot choose what kind of load balancer thet gets created for the AG listener.
  • No load balancer is created for the WSFC.
  • Some regions now have Availability Zones (AZs) which is better than Availability Sets (AS). The template has not been updated to reflect that.
  • Cloud witness! ’nuff said. Building a FSW here is totally valid, but this is an Azure solution. This was built pre-WIndows Server 2016 which is when cloud witness was introduced.

I ran the template and it took 1 hour, 1 minute, and 52 seconds to complete.

In theory, this particular template is an ok(ish) solution if you want to kick the tires on AGs in a non-production way and see what they are all about without purely from an AG perspective. However, this solution they put together is VERY old (I think about five years at this point) and outdated, not to mention people generally deploy AGs after they have databases. If Microsoft wants people to use this, they should update it to reflect a more modern architecture and have the ability to use things like Standard Edition, AZs, and cloud witness.

More recently, Microsoft released a few new things they blogged about the past few months: Automate Always On availability group deployments with SQL Virtual Machine resource provider from December 2018 and Simplify Always On availability group deployments on Azure VM with SQL VM CLI from February 2019.

The workflows for the last two links are a bit … odd. It’s just so much easier to create the WSFC in guest and it solves the major problem I’m about to describe below which prompted this post. If I’m already in the guest, outside of needing to do any load balancer stuff, why would I do stuff in Azure? It’s not really easier and you probably already have PowerShell, T-SQL, or other scripts to do most of this. Some of this feels like a solution looking for a problem that doesn’t really exist. Choice is good but …

Sound off in the comments if you agree or disagree. I’m curious to see what people think.

The Real Problem

I had a good look at the Desired State Configuration (DSC) module for Windows Server Failover Clusters (WSFCs) which is called xFailOverCluster. This is mostly the heart of the matter. The latest version as of this blog post is 1.12.0.0. Specifically, I was seeing what it could and could not do, and there is one major chunk missing from it: validation. The big Azure template I complain about above also does not run validation. Why is this a bad thing?

Look at Microsoft KB327518 “The Microsoft Support Policy for Clustered Configurations of SQL Server with Windows Server” . That links to KBx “2775067” The Microsoft support policy for Windows Server 2012 or Windows Server 2012 R2 failover clusters”. That KB also applies to Windows Server 2016 and 2019. Focus on this line:

“The fully configured failover cluster passes all required failover cluster validation tests. To validate a failover cluster, run the Validate a Configuration Wizard in the Failover Cluster Manager snap-in, or run the Windows PowerShell cmdlet Test-Cluster.”

What does this mean? To have a supported WSFC-based configuration (doesn’t matter what you are running on it – could be something non-SQL Server), you need to pass validation. xFailOverCluster does not allow this to be run. You can create the WSFC, you just can’t validate it. The point from a support view is that the WSFC has to be vetted before you create it. Could you run it after? Sure, but you still have no proof you had a valid configuration to start with which is what matters. This is a crucial step for all AGs and FCIs, especially since AGs do not check this whereas the installation process for FCIs does.

If you look at MSFT_xCluster, you’ll see what I am saying is true. It builds the WSFC without a whiff of Test-Cluster. To be fair, this can be done in non-Azure environments, too, but Microsoft givs you warnings not to do that for good reason. I understand why Microsoft did it this way. There is currently no tool, parser, or cmdlet to examine the output of Test-Cluster results. This goes back to why building WSFCs is *very* hard to automate.

Knowing this, I would change all of this to build the AG (or FCI) VMs with the Failover Clustering feature enabled, then validate and build the WSFC inside similar to what is in the workflow for building the AG on your own. So it’s still a mix of automation and some minor human intervention.

MSFT_xCluster also has another issue in my mind in parsing the code: it seems like it only handles Active Directory Domain Services (AD DS)-based WSFCs. If you wanted to build a Workgroup Cluster variant of a WSFC that does not require AD DS, you are out of luck. This is acknowledged in that MS blog post from February I link above, and at least they call it out. Kudos.

We only support AD domain joined Windows Failover Cluster definition. The FQDN is a must have property and all AG replicas should already be joined to the AD domain before they are added to the cluster.

All of this feels a bit like a case of fire, ready, aim, or more specifically – deploy, understand supportability, automate.

Can You Still Automate AG Deployments Using What MS Provided?

If you are looking at non-production environments such as development and QA, use anything and everything I criticize above since supportability generally is not an issue there. You’re not deploying production systems in the truest sense (i.e. end user/customer facing), but keep in mind they are production systems for your developers and testers.

If you build the base Windows Server IaaS VMs and get through validation and want to automate beyond that, you’d have a fully supported solution if building the WSFC and AG portions are fully automated.

That said, if you know what you’re doing, building all of this yourself won’t take much more time and may even take less time – especially the WSFC piece. You can automate it yourself in different ways. Building a WSFC really does just work these days when it’s done right (kudos to the Windows Server dev team and the HA PMs). Do what works for you; if what Microsoft provides works for you, go nuts. Just know there’s more than one way to approach this problem.

The Bottom Line and What Microsoft Needs to Do

Automation has come a long way but we’re not there fully there yet for clustered and supported configurations of SQL Server running on Windows Server up in Azure or any of the public clouds for that matter. Here’s what needs to happen:

  • Fix things so that Test-Cluster is run and the output is checked before building the WSFC and the AG.
  • Should Microsoft deem it acceptable to support these automated methods already out there for production builds, they need to say that somewhere other than a blog post officially AND update KB2775067 accordingly that the validation requirement is waived. Otherwise there will be conflicting information out there which is bad for everyone including Microsoft. Microsoft needs to stop that nonsense right in its tracks.
  • Update any templates and Wizards accordingly.

When and if these things happen, by all means, automate away in Azure even for production!

Need help with your availability solutions, especially if you are looking at any of the public clouds? Contact us today and we can kickstart your projects into high gear.

Another Day, Another Outage

By: on February 8, 2019 in Disaster Recovery, Downtime, Outage | No Comments

Disaster recovery is in the news this week for all the wrong reasons.

Wells Fargo

Stop me if you’ve heard this story before. A major company – in this case a financial institution – is having a technical outage for not only the first, but the second time in less than a week. Assuming you’re not working for Wells Fargo or one of their customers, chances are had a much better time these past few days. These are their recent tweets as of sometime in the afternoon Thursday, February 7. This post went live on Friday morning the 8th and Wells Fargo is still down – basically two days of downtime.

Figure 1. Wells Fargo acknowledging the problem

Ouch. The last tweet is not dissimilar to a British Airways outage back in 2017. This tweet from Wells Fargo claims it is not a cybersecurity attack. Right now that’s still not a lot of comfort to anyone affected.

Figure 2. It’s not a hack!

Their customers are not enjoying the outage, either. Three examples from Twitter:

 

Figure 3. Customer issue due to the outage

 

 

Figure 4. Day two …

 

Figure 5. More impact

Imagine the fallout: direct deposit from companies may not work, which means people do not get paid. People can’t access their money and do things like pay bills.For some, this could have a lifelong impact on things like credit ratings if you miss a loan, mortgage, or credit card payment. There is not just the business side of this incident.

There were of course, snarky replies about charging Wells Fargo for fees. I wanted to call out an honest-to-goodness impact to someone along with the possible longer term fallout to Wells Fargo themselves. I feel for the person who tweeted, but the reality is Chase or Citi could have an outage for some reason, too. I don’t think anyone is 100% infallible. I have no knowledge of what those other finanical institutions have in place to prevent an outage.

I’m doing what I do now largely because I lived through a series of outages similar to what Wells Fargo is most likely experiencing this week. Those outages happened over the course of about three months. They are painful for all involved. I worked many overnight shifts, a few 24+ hour days … well, you get the picture. I learned a lot during that timeframe, and one of the biggest lessons learned is that not only do you need to test your plans, but you have to be proactive by building disaster recovery into your solutions from day one. All companies, whether you are massive like Wells Fargo or a small shop has the same issues. The major difference is economy of scale.

Everyone has to answer this one question: how much does downtime cost your business – per minute, hour, day, week? That will guide your solution. Two days in a row of a bank being down is … costly.

There may be another impact, too – can Wells Fargo systems handle the load that will happen when the systems are online again? I bet it will be like a massive 9AM test. Stay tuned!

You don’t want to have a week like Wells Fargo. Take your local availablity as well as your disaster recovery strategy seriously. I’ve always found the following to be true: most do not want do put in place proper diaster recovey until their first major outage. Unfortunately at that point, it’s too late. Once the dust settles, they’ll suddenly buy into the religion of needing disaster recovery. Let me be clear – I do not know what the situation is over at Wells Fargo; zero inside information here. Did they have redundant data centers and the failover did not work? Were some systems redudnant but not others? Were there indications long before the downtime event that they missed? There are more questions than answers, and I’m sure we’ll find out in good time what happened. The truth always comes out.

A Different Kind of Mess – Quadriga

This week also saw a very different problem as it relates to finance: the death of Quadriga CEO Gerald Cotten. Why is his passing away impactful? When he died, apparently he was the only one who knew or had the password for the cryptocurrency vault. There are other alleged issues I won’t get into, but with his laptop encrypted (apparently his wife tried to have it cracked), literally no one whose money was in the vault can get it. The amount stored in there I’ve seen in various stories has been different, but it is well north of $100 million. The lesson learned here is that someone else always needs to know how to access systems and where keys are. Stuff happens – including death. There are real world impacts that can happen when systems cannot be accessed.

Watch Your Licenses

Since we are talking about outages, the Register published a story today about how a system would not come up after routine maintenance due to the software license expiring. I have been through this with a customer. We were in the middle of a data center (or centre, for you non-US folks) migration. We had to reboot a system and SQL Server would not come up. I looked in the SQL Server log. Lo and behold, someone had installed Evaluation Edition and never converted it to a real life. It also meant the system was never patched and never rebooted for a few years! Needless to say, there was no joy in mudville. That was a very different kind of outage.

The Bottom Line

If you do not want to be another disaster recovery statistic and prevent things like the above from happening, contact SQLHA today to figure out where you are and where you need to be.

 

SQLBits 2019 Is Only a Few Weeks Away

By: on February 7, 2019 in Conference, SQLbits, Training | No Comments

Hard to believe that SQLBits 2019 is only a few weeks away. I’m looking forward to speaking there again. It’s always an honor to be selected and Bits is one of my favorite conferences to attend if I can make it. This year, it’s in Manchester which is somewhere in the UK I’ve yet to visit, so I’m excited about that as well.

I’m currently finalizing the content and the lab for the Training Day I will be delivering on Thursday, February 28 – Modern SQL Server Availability Architectures. Hopefully the venue can support the lab, so we’ll see. That aspect is completely beyond my control, but what I have cooked up should hopefully be fun if we get to do it. You’ll need to bring your own laptop and make sure you run the test link that is linked in the description. Last I checked, seats were filling up quickly, so don’t miss out!

I’ll also be doing a session on Friday, March 1 –  Common Troubleshooting Techniques for AGs and FCIs at 14:25 (2:25 PM for those of you on my side of the pond).

If you haven’t registered already, what are you waiting for? If you have, see you there. Come up and say hello!

SQLHAU Update – More Online Dates, New Class Announcement, and Chicago Classes

By: on February 4, 2019 in SQLHAU, Training | No Comments

Happy post-Super Bowl Monday, everyone. Almost time for pitchers and catchers!

New Online Classes

Back in November, I announced our first set of live online training and classes which covered through June of this year. I’m pleased to announce the dates for the second half of 2019. These classes will be live and instructor-led. Our fully demand courses will debut later this year. The two new classes and dates are as follows:

Chicago Classes

In August, I’ll be heading back to the Microsoft Technology Center in Chicago to deliver two classes:

I’m particularly excited about Modernizing Your SQL Server Infrastructure Boot Camp. It’s something that has been on my mind for awhile. It goes without saying that it will have our signature labs. The full course description can be found here. It’s going to be a lot of fun to teach.

The Modernizing Yout SQL Server Infrastructure Boot Camp is $995 and the Always On Availability Groups Boot Camp is $1495. You can get a bundle of both for $2195, a savings of $295. That price includes food, and trust me, you will not go hungry. Space is limited – reserve your spot today!

Both classes currently have their biggest discount – 25%. Don’t miss out!

Still Time to Register for the 2019 First Half Classes

Don’t forget about the classes announced in November – three clases with Europe/UK times in June as well as the SQL Server Availability Solutions in A Cloudy, Virtual World in April with US times.

Want To Stay “In the Know” for SQLHAU’s Training?

If you want to find out about training dates, new classes, and more, sign up to get updates from SQLHAU. While you’re at it, you can also choose to see when new Mission Critical Moment videos are published, get information about upcoming SQLHA webinar, and of course, subscribe to the Mission Critical Update – our newsletter. The first issue will go out this week, and for training, there’s an exclusive offer that you will not find anywhere else and applies to all SQLHAU classes currently scheduled.

Read Only Filegroups and Availability Groups

By: on December 13, 2018 in Always On, AlwaysOn, Availability Groups, Read Only Filegroups, SQL Server | 2 Comments

A question came across my inbox this week which I decided to investigate: what happens if you have read only filegroups as part of your database, and you want to use Always On Availability Groups? Let’s find out.

First, I have a database with two filegroups: one read write (PRIMARY) and one read only (ROFG).

Figure 1. Filegroups

I also have a table (ROTable) created on ROFG.

Figure 2. Table on a read only filegroup

I then created an AG named ROFGAG. I selected automatic seeding (more on that in a second) to initialize my secondary replica. According to the Wizard, all is good … or is it?

Figure 3. Successful AG creation

I checked the status of the AG and it wasn’t synchronizing. Seeing as this is not my first rodeo, I started looking and lo and behold, the database was not joined to the AG even though the Wizard said everything was OK at the end of creation.

Figure 4. DB not joined to the AG

So there appears to be a bug in the AG creation where if you have a read only filegroup and select automatic seeding, it didn’t initiate the seeding for some reason. Looking in the SQL Server log on ALEX,it gave me the telltale permissions issue.

Figure 5. Seeding permission not granted

Granted, this is SQL Server 2016 where there’s the 3 minute “issue”, but that generally worked OK through the Wizard. I manually issued the ALTER AVAILABILITY GROUP ROFGAG GRANT CREATE ANY DATABASE and lo and behold …

Figure 6. Seeding initiated

You can also manually back up the database, copy it to the other server, and restore it to the instance that is the secondary replica. Either way, once the database was on the secondary replica, with seeding , it joined to the AG just fine. If you manually restore the database with NORECOVERY, the database has to be joined to the AG.

Figure 7. Join the DB to the AG

I inserted a few rows into a table on the read write filegroup, and everything is synchronized as expected. Keep in mind that on the secondary replica you have the same files as the primary and both filegroups. So having a read only filegroup works just fine with an AG.

Figure 8. AG status

I now deviated from the plan. What if you just want the read write filegroup on the secondary replica, and not everything? Does that work? To do this, I removed ALEX as a secondary replica and then deleted ROFGTestDB from that instance. On ALEX, I then restored ROFGTestDB, but just the read write filegroup. That means I wouldn’t get ROFG or its corresponding .ndf file.

Figure 9. Restoring just the read write filegroup

Since I restored the database, in the Wizard I just selected the Join only option for initial data synchronization. SQL Server thinks all is fine … or is it?

Figure 10. Adding the replica with no read only filegroup

Unfortunately, the AG is not synchronizing even though it joined supposedly with no issues. The AlwaysOn_health extended events trace shows the join, but no other information indicating a problem. The same holds true for the SQL Server log. I was inserting rows into a table, and that DDL executed fine, so the issue is apparently not on the primary.

Figure 11. All is not well

At this point, I removed the secondary replica from the AG. That completed successfully, but after doing that, ROFGTestDB on ALEX (where the restore without the read only filegroup happened) is showing as Suspect.

Figure 12. ROFGTestDB is suspect

The only difference is that before when things worked, all of the filegroups were restored. I deleted ROFGTestDB from ALEX, and then restored it from a backup WITH NORECOVERY with both the read write and read only filegroups.

Figure 13. ROFGTestDB restored

I then joined ALEX to ROFGAG, inserted a few rows, and everything works as expected.

Figure 14. AG working

It looks like AGs work with DBs that have read only filegroups, but you have to make sure you have all files and filegroups. If you restore the database with only the read write filegroup(s), it appears the AG will not synchronize. I plan on following up with Microsoft to see if this is expected behavior, but it is definitely interesting because it reports success when it really is failing. That also means that at least the way things work today, you can’t have a smaller footprint on a secondary replica – it’s all or nothing.

Hope this helps some of you!