AGs and WSFC OS Rolling Upgrades: What Works and What Doesn’t
It’s been a crazy summer so far. I haven’t had much time to blog but an interesting scenario came up recently that I wanted to share with everyone. I have also missed writing technical posts, and this is a good one to scratch the itch.
In Windows Server 2016, Microsoft added the ability to do a rolling OS upgrade with a Windows Server Failover Cluster (WSFC). This is Microsoft’s official documentation. The SQL Server side of things is not well documented, especially for Always On Availability Groups (AGs), so I did a few tests. The results are fairly interesting. All start out with SQL Server 2012 SP4. The goal is to get to Windows Server 2016 and some version of SQL Server, whether it is the same or different.
A few things you should know
- A cluster OS rolling upgrade works/is supported N-1, so you can upgrade Windows Server 2012 R2 to 2016 or Windows Server 2016 to 2019, but not Windows Server 2012 R2 to 2019.
- Running validation will fail when you add a Windows Server 2016 node to a 2012 R2 WSFC.
- I strongly recommend only adding the upper version nodes with that version. I saw some weird things when I tried to do it the other way.
One of the reasons this upgrade method is important is that especially in virtualized or public cloud-based architectures, you’re no longer tied to hardware for OS uprgades so how do you upgrade clustered configurations of SQL Server?
Strong Warning
The configuration for a cluster rolling upgrade allows for mixed Windows Server versions to coexist in the same WSFC. This is NOT a deployment method. It is an upgrade method. DO NOT use this for normal use. Unfortunately, Microsoft did not put a time limit on how long you can run in this condition, so you could be stupid and do something like have a mixed Windows Server 2012 R2/2016 WSFC. Fire, ready, aim. The WSFC knows about this and you’ll see a warning with an Event ID of 1548.
Also, what is shown below when adding replicas to lower versions of SQL Server is also NOT a normal deployment method. It’s an upgrade tactic only. Don’t be dumb – especially since it’s a one way street once the AG fails over to the newer version.
If it was me, I would do all of this in one weekend if I could. Circumstances may not allow that, but don’t drag this on very long.
One very good reason to not stay this way is that when dealing with the cluster functional level, unless all nodes are at the higher version, you’re stuck at the older functionality/behaviors. A good example of a benefit you could not take advantage of is cloud witness in Windows Server 2016 if you are at a functional level of 8. 9 is Windows Server 2016 and 10 is Windows Server 2019.
The Tests
Detailed steps are not here, just the overall process to see what does/does not work. For a real production move, there are things I would add to this list but that isn’t what I was going for here. I also did not necessarily use the latest/greatest updates. I was testing process here so it was pretty quick, down, and dirty. There are also many more variants. You can try those yourself.
Test 1
2 x SQL Server 2012 SP4/Windows Server 2012 R2 nodes
2 x SQL Server 2012 SP4/Windows Server 2016 nodes
High level process:
- Add the Windows Server 2016 nodes to the 2012 R2 WSFC. The WSFC is mixed, and looks similar to the screen shots below.
- I joined the instances on the Windows Server 2016 nodes as replicas to UpgAG. This query output shows that everything has joined and seemingly looks okay. Or is it?
- I then set 2012R2UPGN1 and 2016UPGN1 to synchronous data movement since I’m trying to go to WS2016. That works fine.
- I should be able to manually initiate a failover from 2012R2UPGN1 to 2016UPGN1 right? Wrong. For whatever reason, SQL Server 2012 in this configuration can’t update possible owners on the AG resource at the WSFC level.
Result – Does not work. I tried to get around this by manually dealing with possible owners, but the configuration was not having it. It looks like if you want to do a rolling cluster upgrade and stay on all SQL Server 2012, you’re out of luck.
Test 2
2 x SQL Server 2012 SP4/Windows Server 2012 R2 nodes
2 x SQL Server 2012 SP4/Windows Server 2016 nodes
High level process:
- Add the Windows Server 2016 nodes to the 2012 R2 WSFC.
- Using the configuration from Test 1 that bombed, perform an in-place upgrade to SQL Server to 2016 on the two Windows Server 2016 nodes. Since everything at SQL Server 2012 failed, I wanted to see if this would make a difference.
- Set 2012R2UPGN1 and 2016UPGN1 to synchronous.
- I failed UpgAG over to 2016UPGN1 from 2012R2UPGN1. This was successful, and the SQL Server 2016 replicas can now synchronize, but they cannot with the 2012 ones. That makes sense given 2012 is downlevel and upon failing over, the database was upgraded to 2016.
- Remove the SQL Server 2012 replicas from the AG.
- Evict the Windows Server 2012 R2 nodes from the WSFC.
- Upgrade the WSFC functional level.
- Party.
Result – Success! This mixed combo worked and I was left with both Windows Server and SQL Server upgraded with minimal downtime and little fuss.
Test 3
2 x SQL Server 2012 SP4/Windows Server 2012 R2 nodes
2 x SQL Server 2012 SP4/Windows Server 2016 nodes
High level process:
- Add the Windows Server 2016 nodes to the 2012 R2 WSFC
- Join the instances on the Windows Server 2016 nodes as replicas to UpgAG.
- This time upgrade all four instances in place to SQL Server 2016.
- Set both 2012R2UPGN1 and 2016UPGN1 to synchronous.
- Manually fail UpgAG from 2012R2UPGN1 to 2016UPGN1. This succeeds.
- Remove the 2012 replicas from the AG.
- Evict the 2012 R2 nodes from the WSFC.
- Upgrade the WSFC functional level.
- Drink a beverage of your choice.
Result -Similar to Step 2, this worked with none of the problems of the all SQL Server 2012 configuration.
Test 4
2 x SQL Server 2012 SP4/Windows Server 2012 R2 nodes
2 x SQL Server 2017 CU 15 + GDR/Windows Server 2016 nodes
High level process:
- Add the Windows Server 2016 nodes to the 2012 R2 WSFC
- When joining the SQL Server 2017 instances on the Windows Server 2016 nodes as replicas to UpgAG, there is a different behavior. Note the join state highlighted. That is not good. Can that be changed?
- Setting things to synchronous does not seem to help.
- Changing ownership of UpgAG to 2016UPGN1 does not move the needle much … but it moves it a little. Note that 2016UPGN2 now has a status of JOINED_STANDALONE.
- Removing the 2012 replicas from the AG has no change on 2016UPGN1.
- Failing UpgAG to 2016UPGN2 however gets everthing where it should be for a join state.
- Evict the 2012 R2 nodes from the WSFC.
- Upgrade the WSFC functional level.
Result – Success! This one was not straightforward, and I do not know why when you join a SQL Server 2017 replica to a lower level AG in this configuration it gives a status of NOT_JOINED. Clearly that did not happen in Test 2 or Test 3. That is a question for the dev team and another day.
Final Thoughts
While three of the configurations worked in the end, one gave me a bit of heartburn, and the fourth just didn’t work. As always, you should plan upgrades and test them before doing them on your production boxes. If you’re looking for help on upgrades or anything else SQL Server or infrastructure-related, reach out to us. We’d love to help you get to your destination!