Anti-Virus + Windows Failover Clustering + SQL FCIs = Bad
Recently, I helped a customer plan and configure a three node multiple instance failover cluster. The Windows failover cluster (WSFC) had four instances of SQL Server (3 x 2008 R2 with SP1 and 1 x 2012 with CU2). However, when they went to go enable their anti-virus on the cluster nodes (Symantec if you must know) and set the exclusions I document in my book, things went south. Pretty much everything went offline. They were able to get pretty much everything up but SQL Server Agent for one instance just wasn’t working. I just got off a troubleshooting call with them and I think we fixed that (I’ll know for sure later once they do a few other things). If you think this is an uncommon event, it’s not as evidenced by this quick Twitter exchange I had:
With most customers, anti-virus is a standard piece of kit to install on all servers no matter what. The base rationale makes sense. I mean, who wouldn’t want to protect your servers from potentially bad stuff, right? Here’s where I lose the plot a little bit. First and foremost, your production servers are not desktops. Joe and Jane Administrator should not be surfing the ‘net from there. Assuming that is the case, the risk of getting things like malware is probably really small. If they’re doing that, you’ve got to fix your processes and lock things down. I’m not saying it can’t happen, but it shouldn’t. If you’re on the server and need to use the Internet to go read a support KB, that’s perfectly acceptable usage. But to check their Hotmail/gmail/match.com/watch porn/personal stuff? Hell no. Most admin work should be done from workstations with remote tools – you shouldn’t have to log onto or RDP into the server for most tasks. That drives the risk up of something going wrong.
The other aspect of this is really file-based. Think about it – we get documents and download files from various places. That’s arguably how a lot of the whole malware thing started. A production SQL Server server (that always looks weird to me) should not be a file repository whether it is clustered or not. I will always say that in most cases, a production boxes with SQL Server on it should be dedicated for that purpose only. I know people want to maximize usage and you can create a clustered file share, but I’m assuming here that in most cases where you deploy FCIs, it’s on dedicated WSFCs just for SQL Server. If that is the case, why would you need anti-virus especially if you’re monitoring to ensure people are not doing rogue thins on the cluster. Again, it’s easy to minimize your risk exposure with having good processes. Imagine that.
If your company is still not convinced by these rational arguments, have a quick gander at these KB articles from Microsoft:
- KB309422 – How to choose antivirus software to run on computers that are running SQL Server
- KB250355 – Antivirus software that is not cluster-aware may cause problems with Cluster Services
Pay particular attention to this bit in 250355:
You can run antivirus software on a SQL Server cluster. However, you must make sure that the antivirus software is cluster-aware. Contact your antivirus software vendor about cluster-aware versions and interoperability.
Not all anti-virus programs are created equal. So make sure if you need to install anti-virus, you get the right thing. If what you have is not cluster-aware, caveat emptor.
Let me add to this cautionary tale: back in the SQL Server 2000/2005 days on Windows Server 2003, I was working with another company. We got things up and running on the cluster with no issue. I told the team responsible for anti-virus how to configure it for the clustered instances of SQL; not much more I could do. Right after we went into production, a day or two in, my databases were marked as Suspect. I did a bit of digging and the smoking gun wasn’t hard to find: right in the Event Log, it was clear as day – a virus scan kicked off at about 11AM and minutes later my databases decided to take a dirt nap from which I had to restore them from backup. They were permanently corrupted due to the scan kicking off. The admins swore up and down it wasn’t them (even though I had the proof). Things were fine for one day and I bet you can guess what happened the next day at the same time. I was livid and took this up the chain. The problem eventually got fixed, but it should have never happened to begin with. That was largely a political issue at the end of the day, but it also speaks to the fact not everyone knows about clusters and SQL Server.
This is a major reason you set filters for things like the the .mdf, .ndf, and .ldf files even on standalone SQL Servers. This is magnified on clusters where if you have an instance that fails over to another node that now takes ownership of the disks associated with that instance, the anti-virus program is going to now see them and say, “Awesome! New files to scan.” That will not only hold up your instance starting, but see my previous paragraph – you may get more than you bargained for. Not only may you blow SLAs, you have the potential of hosing databases. Just what every DBA needs.
UPDATE: Also check out Allen Kinsel’s (Twitter) blog post SQL Server 2008 & IPV6 vs Symantec. It shows where Symantec does something even more fun: endpoint protection. As my friend Nic Cain (blog | Twitter) pointed out to me, endpoint protection can be even worse than anti-virus on a clustered configuration – especially since IPv6 is enabled by default on Windows Server 2008 (and up) and you shouldn’t disable it. Allen’s post is probably why my customer’s cluster went south to begin with, and SQL Server Agent for the one instance was just a casualty.
My recommendation to you: DON’T PUT ANTI-VIRUS ON SERVERS WITH SQL SERVER INSTALLED ON THEM AND ESPECIALLY IF THEY ARE CLUSTERS. You’ll be much, much happier.