Maintaining Quorum During Windows Patching and Updating
I’ve been working with a customer to deploy their new Windows Server 2008 R2 SP1-based failover cluster that will have bot SQL Server 2008 R2 and SQL Server 2012 instances. One thing I always talk about in my classes as well as the presentations I focus on patching is this: you must maintain quorum during the patching and updating cycle. Quorum is one of the most important considerations in general for your clustered deployments (whether you’re implementing traditional failover clustering instances or availability groups), and it’s something that quite frankly, I find there’s a real lack of knowledge out there. It’s so important it’s most likely getting a whole chapter in my upcoming book Mission Critical SQL Server 2012.
I have talked to customers that have had SQL Server outages because their Windows guys when patching the underlying nodes for the Windows Server failover cluster (WSFC) rebooted too many at once, so there were not enough voters us, thus bringing the WSFC (and SQL) down when it wasn’t expected. This is NOT a scenario you want. While in theory it’s easy to control and keep track of when things reboot, this little gremlin in Figure 1 pops up after a little while reminding you it needs to be done:
That’s all well and good, but the unskilled administrator who happens to see that message if they have to log onto the server may think nothing of it and just click Restart now. Yikes! The value of the reminder can be altered as shown in Figure 2, but you still need to worry about the reboot being done.
There’s another dark underbelly I’ve seen these dialog boxes associated with: automatic updating of Windows via Windows Update. That is a much scarier situation because that means your servers could be updating 24×7 on their own, and some of those updates may even force reboots. Check your settings! I still see production servers at client sites that have automatic updating enabled. If they are, you need to have a serious discussion with your server admins as to why this is being done. Automatic updating is not in line with mission critical. Yes, it’s important to have a patch management strategy that keeps you up-to-date, but doing it smartly is the key. Not blindly.
We are currently working on moving our entire datacenter to another location. Because we have never shutdown everything before, we recently did a “dry run” of shutting it completely down, waiting a few minutes, then brought everything back online. We did it in a dependency order, both ways. When everything was back online, we had 2 clusters that had lost the quorum. One was easily recovered – my production SQL cluster. But our file/print cluster was not so lucky. We had to attach a new drive to be used for the quorum. It’s back online, but still having some problems. My question is, do you have a recommendation of how to do this so we don’t lose the quorum? Or at least minimize our risk? Thank you.
This is honestly more than a blog post answer but I’ll do my best here. Moving data centers is never fun. Both Ben and I have helped customers do it numerous times and it can bring out the best and the worst.
A big part of success is related to knowing your environment. Monitoring is another piece of the recipe. Quorum relies on your knowledge (i.e. “we’ve got 3 nodes, and we’re using Node Majority”), but getting notified when you are hitting a tipping point is crucial. That would certainly minimize your risk immensely because you could act before the “bad” happens.
Losing quorum should be preventable most of the time if you are keeping your eyes out. Also, it has to be said: your admins need to know what they are doing and there needs to be good communication.
As you point out, it only takes once to realize how important quorum can be.
I do a lot of work with customers to come up with these strategies, and every one is different.
I have a problem with my windows 2008 2 node cluster running sql 2008 r2 – at first, sql was only patched on the virtual node. this has been addressed (after failing over multiple times and reading the error log) but automated patching is still being used, and after any san reconfiguration or server patch, it starts failing over randomly. There was a file share which failed first, so that has been removed and the server hasn’t failed over lately, but i can’t help but think the root cause has not been determined. I suspect automated patching with Shavlik, but can find no references…any theories? (also, the registry is crazy-weird)
Without access to your systems, hard to say completely what is going on. I’m not sure what you mean by virutal node here. There is no such concept with WSFC or FCIs. That said, it sounds like you have a lot of stuff going on – automatic patching of Windows servers, storage updates, etc. It sounds like you haven’t tested this out and are seeing the results. To help you diagnose sounds like more of a consulting engagement and more than a blog post reply unfortunately.