Bug – Combining Failover Clustering & Log Shipping When Programs Installed On Another Drive
A customer of mine contacted me this week about a problem they were having. They have a two-node failover cluster on Windows Server 2008 R2 with an instance of SQL Server 2008 with SQL Server 2008 SP1. They installed SQL Server’s program files to another drive – not the main system drive (C). They then configured log shipping. Everything worked fine until they were testing failover. When the instance was failed over to the node which was not the first one installed, log shipping stopped working. When they failed it back, everything worked. After a WebEx session, it looks like Setup didn’t put everything in the right place on the other node. Before I rushed to any conclusions, I needed to reproduce the problem to see if it was possibly something that went wrong in their setup. Here’s what I did.
1. I created a two node W2K8 R2 cluster with a slipstreamed SQL Server 2008 SP1 instance.
Here is where I installed the progams to:
and on the next dialog
So far, so good.
2. I configured log shipping to an instance on another server while the instance was on the original (first) node it was installed on. Everything worked great.
3. I failed the instance over to the other node, and log shipping failed.
4. I failed the instance back to the original node. Lo and behold, log shipping worked again.
So what happened?
If you look at the job step for the transaction log backup job, here’s what it is calling:
“Z:Program FilesMicrosoft SQL Server100ToolsBinnsqllogship.exe” -Backup 6B81BF42-4AA8-4DE3-8349-5E54EE0C52ED -server KILROY
Here is what the programs look like on the first (original) install node for Drive Z. Note the shared tools directory with SqlLogShip.exe.
Here is what the programs look like on the second node (add node) install:
C drive showing the shared tools directory with SqlLogShip.exe.
Z drive showing no shared tools directory.
So it’s pretty clear to me that the Add Node operation is not putting the files in the same place even though the other node has the same drive structure, thus causing log shipping to stop working in a failover.
If you install things all to the original system drive (such as C), everything works fine so that is a workaround. But I know some of you like to put program files in places other than the system drive.
I have written this up over on the Connect site here, so if you want to try to get this fixed, vote for it!
The funny thing about this is that on the initial install I selected everything to go to the Z drive, but it still install some files to C. Interesting.
I have stumbled onto multiple issues now with 2008 SP1 on a windows cluster with the Integrated Install with add a node due to the placement of the files when you choose a non-standard drive http://connect.microsoft.com/SQLServer/feedback/details/491109/subsystem-powershell-could-not-be-loaded.
I then found your article which is going to cause me grief when I deploy my DR with log shipping. Based on the research that you did and the text from the first article the only solution:
1. Suppose to be fixed in SP2 due 3Q2010,
2. Roll the cluster to the second node installed and then log onto the initial node and do and uninstall and remove the files, then come back through and do add a node to get the files onto C and then fix the issue where everything says my non-standard drive:
3.Uninstall everything and follow the Advanced/Enterprise Solution.
As the first article would have me perform the fix on the initial node and remove and add back in on C, but if I am readying your article correctly only your initial load does not have the log shipping issue.
My summary on all this is either C: or Advanced Enterprise Solution … Do you agree?
1. In theory Cluster Prep should work since the prep part makes you specify the path. I haven’t tested that as a workaround.
2. Putting everything on the system drive does work. This is the most common configuration at most customers anyway. Is there a reason you did not do this in the first place?
3. As far as rolling to the 2nd node, and that whole uninstall thing – that won’t work. It’s the Add Node (2nd node) process that’s broken, not the intial.