By: Allan Hirt on March 12, 2019 in SQLHA, Webinar | No Comments
As we first announced in our inaugral issue of our newsletter, Mission Critical Update, we will be holding free 30 minute webinars every other month starting in April. The first one will be on Wednesday, April 24 at 11AM Eastern/8AM Pacific/3PM GMT. We will send out details on how to join as the webinar gets closer.
Here is the information about the webinar:
The Only Constant Is Change
Whether you are at the top of the corporate ladder or an administrator, everyone has to manage change. That change comes in many forms including:
- Applying patches to resolve issues and improve reliability as well as security updates to keep your systems and business data safe
- Upgrading/migrating to new major versions of Windows Server, Linux, and SQL Server
- Taking advantage of evolving hardware-based options such as hyperconverged solutions on premises and zones in the public cloud as well as new features to improve your systems and applications
- Planning and deploying an updated architecture for the solution to make everything work in harmony
Each of these areas is a moving target. Strong organizations embrace change with well thought out plans and make the most of it with the support of management. With all of that in place, the disruptions to the business are minimized. Want to be that type of organization? Sign up for this free 30 minute webinar from SQLHA® to hear mission critical experts Allan Hirt and Max Myrick talk about the effect of change and how to stay on top of it. You will also get a sample high level plan to see how all of this can be achieved.
Want to get notified of upcoming webinars? Don’t want to miss an issue of Mission Critical Update? Looking for the latest on training and need to know when a Mission Critical Moment is published? Subscribe today and choose what you would like to hear from us. Mission Critical Update #2 is going out later this week – don’t miss out!
By: Allan Hirt on March 8, 2019 in Conference, SQLbits, SQLHAU, Training | No Comments
Happy Friday. Things are finally calming down over here at SQLHA HQ. Last week I was in the UK at SQLBits 2019 where I delivered a Training Day on Thursday, February 28 and a regular session on Friday, March 1. I’ve always loved hopping across the pond to speak at Bits; it’s one of my favorite conferences. The atmosphere is always great, and the organizers do such a good job. It also doesn’t hurt that I love going to the UK. It was my first time in Manchester. Unfortunately I didn’t get much time to explore. I arrived in London on Monday night and took the train up to Manchester on Tuesday. I spent the next two days working (the inside of my hotel room was lovely), did the Training Day, my session, and headed back down to London before flying home.
The Bits venue was really lovely – it was in the old (and converted) Manchester Central train station. I’m a train guy, so it thrilled me to no end. You can see it in the picture later in this blog post. The Training Day went smoothly – I had nearly a full room. I never take it for granted that after all these years and conferences around the world, people still want to show up and hear me speak. So thank you one and all who came out to the Training Day which was not free. I know you have a lot of choice, and am honored that you selected me.
SQLBits this year made the decision to have 50 minute sessions. Anything under an hour can be … challenging, but I felt really good about my regular session “Common Troubleshooting Techniques for AGs and FCIs”. The Bits team already has the session uploaded, so feel free to view it at the link above. The audience was great, and I had a bunch of questions after. Luckily it was break time so people could ask me as I was tearing down. I also wanted to get off the stage quickly so the next presenters could get set up. I’m considerate like that!
The sign for the room
I hope to be able to go back in 2020, but if you ever get the chance, I highly recommoend attending at least one SQLBits conference in your career.
Getting home was a bit of an adventure. There was snow back home in MA, and for the first time in a long time, I did not fly direct from London. I went via JFK. My flight from JFK to Boston was cancelled, so I wound up staying the night in New York City and taking Amtrak home. It was not that big of a deal, and as someone who is on the road a lot, par for the course. These things happen. ProTip: if this happens to you, be nice to agents at the airport or on the phone. They’re having a crappy/stressful day, too, since they’re dealing with everyone else who is being affected. You get more with honey than vinegar.
The Training Day featured an all new lab that I put together, and was the first time in a few years I’ve done a Training Day with a lab. If you’re wondering why I sometimes have preconference sessions (nee Training Day in Bits-speak) without labs. I was one of the first (if not the first) to pioneer the use of labs on this scale in the SQL Server community, and it’s nice to see some others doing it now, but it’s never a slam dunk. Why?
First, before I even think of putting a lab together, I work with the conference organizers to see if the venue can support approximately up to 100 people doing labs. Since that may pepole banging on WiFi will consume a lot of bandwith, it would be a miserable experience for all involved (including yours truly) if connectivity sucked.
Second is the cost involved. There’s a cost associated with each student not only for the backend lab stuff, but also to the conference organizers for the bandwidth. If it’s not going to ultimately be too expensive, we won’t do it. Nobody is in the business of losing their shirts on these things. It’s much more economical now than the first time I did this at Bits and Summit five or six years ago.
Assuming the cost and the infrastructure is there, is it worth doing a lab? Let me say this: when others have said that labs were a lot of work and not worth it (some of whom have changed their minds since …), I did them. I’ve always believed in the power of hands on learning. I did labs for private and public classes when I had to bring (or send) an external hard drive with VMs and load them onto PCs with hours and hours of setup. That does not scale for so many reasons, not the least of which is the size of laptop needed to do my labs that way isn’t a low end spec. I think back to some of the classes I delivered in Australia where I arrived days earlier, had to load PCs up, test them, etc. Doing labs was, and still is, a big commitment.
For the past 5 or 6 (maybe 7?) years, my labs have been done through a browser. Everyone still gets their own set of virtual machines, but they no longer need to be loaded on everyone’s PC and use a hypervisor locally. Soon I’ll be offering Azure- and AWS-based labs in addition to VM-based ones for my classes and possibly precons … stay tuned!
There’s the whole instructional design part of this which is putting together the VMs (and getting them all to the right point …) as well as writing the lab manual. For one day of training, it’s hard to get a lab that works in 60 – 90 minutes knowing people are at different skill levels, but you want to do something that’s meaningful and not just “click-click-click you built something but you didn’t learn anything”. Needless to say, labs were and still are a big time investment on my end, and I feel they are worth the effort. It’s gratifying to see people loving them as they are working their way through.
Finally, with a one day class, you have to balance instructional content to put a lab in, so you need to design that part of the day even better since you’re giving up roughly 25% of your time to give that hands on experience.
I will say after all these years, it never gets old watching a ton of people all access VMs (across 100 people it’s easily 400 VMs or more) simultaneously. That’s a lot of horsepower, and also why I need to know well in advance if it’s going to happen because the folks running the backend need to ensure there’s enough horsepower reserved for the day. It’s not just “show up and run labs”.
Attendees doing labs at my SQLBits 2019 Training Day
If you want the best training and labs, sign up for one of the upcoming SQLHAU classes that are being delivered live online, or come see me in person during the Chicago dates in August. Use the code BLOG20 to get 20% off the April online class “SQL Server Availability Solutions in a Cloudy, Virtual World” which does has a lab. The discount is good through March 31. You can also subscribe to not only get our newsletter, but also get notified when we have new training or other training-related items. Sometimes we offer subscriber-only discounts 🙂
VMware vExpert 2019
I’m pleased to announce that yesterday I was re-awarded vExpert for 2019. Like people showing up for things like the SQLBits Training Day, I never assume that being renewed is automatic. Thank you, VMware!
By: Allan Hirt on February 11, 2019 in Always On, AlwaysOn, Automation, Availability Groups, Azure, FCI, SQL Server, Template, Windows Server Failover Cluster | 3 Comments
I was working on some things this past weekend for a few upcoming projects, and one of those involves Azure and automation. Anyone who knows me is that I will praise when necessary, and call out when something is not quite right. Microsoft’s Azure, Windows Server, and SQL Server teams earned my “What Were They Thinking?” badge.
What Is This About, Allan?
Microsoft has published a few solutions to automagically build AG solutions for you (there are none for FCIs right now) up in Azure using IaaS VMs. The one below has been around for some time and is easily found in the Portal.
Figure 1. Azure Template for Creating a Full AG Solution
Let me digress for a moment and say how this solution in my opinion is not quite kosher for a production deployment:
- AlwaysOn is not the feature name. Always On Availablity Groups, or just Availability Groups is the name of the feature. Always On has had a space now going on nearly five years. I may have a few blog posts about this somewhere 😉 (here is one example)
- No customer I know is going to build separate Active Directory Domain Services (AD DS) servers just for an AG; they’re going to have existing ones that they will use.
- The template only allows you to select Developer or Enterprise Editions of SQL Server, not Standard. Since this is a two-node only configuration, I’m not sure why this was not updated for SQL Server 2016 and later.
- You cannot choose what kind of load balancer thet gets created for the AG listener.
- No load balancer is created for the WSFC.
- Some regions now have Availability Zones (AZs) which is better than Availability Sets (AS). The template has not been updated to reflect that.
- Cloud witness! ’nuff said. Building a FSW here is totally valid, but this is an Azure solution. This was built pre-WIndows Server 2016 which is when cloud witness was introduced.
I ran the template and it took 1 hour, 1 minute, and 52 seconds to complete.
In theory, this particular template is an ok(ish) solution if you want to kick the tires on AGs in a non-production way and see what they are all about without purely from an AG perspective. However, this solution they put together is VERY old (I think about five years at this point) and outdated, not to mention people generally deploy AGs after they have databases. If Microsoft wants people to use this, they should update it to reflect a more modern architecture and have the ability to use things like Standard Edition, AZs, and cloud witness.
More recently, Microsoft released a few new things they blogged about the past few months: Automate Always On availability group deployments with SQL Virtual Machine resource provider from December 2018 and Simplify Always On availability group deployments on Azure VM with SQL VM CLI from February 2019.
The workflows for the last two links are a bit … odd. It’s just so much easier to create the WSFC in guest and it solves the major problem I’m about to describe below which prompted this post. If I’m already in the guest, outside of needing to do any load balancer stuff, why would I do stuff in Azure? It’s not really easier and you probably already have PowerShell, T-SQL, or other scripts to do most of this. Some of this feels like a solution looking for a problem that doesn’t really exist. Choice is good but …
Sound off in the comments if you agree or disagree. I’m curious to see what people think.
The Real Problem
I had a good look at the Desired State Configuration (DSC) module for Windows Server Failover Clusters (WSFCs) which is called xFailOverCluster. This is mostly the heart of the matter. The latest version as of this blog post is 188.8.131.52. Specifically, I was seeing what it could and could not do, and there is one major chunk missing from it: validation. The big Azure template I complain about above also does not run validation. Why is this a bad thing?
Look at Microsoft KB327518 “The Microsoft Support Policy for Clustered Configurations of SQL Server with Windows Server” . That links to KBx “2775067” The Microsoft support policy for Windows Server 2012 or Windows Server 2012 R2 failover clusters”. That KB also applies to Windows Server 2016 and 2019. Focus on this line:
“The fully configured failover cluster passes all required failover cluster validation tests. To validate a failover cluster, run the Validate a Configuration Wizard in the Failover Cluster Manager snap-in, or run the Windows PowerShell cmdlet Test-Cluster.”
What does this mean? To have a supported WSFC-based configuration (doesn’t matter what you are running on it – could be something non-SQL Server), you need to pass validation. xFailOverCluster does not allow this to be run. You can create the WSFC, you just can’t validate it. The point from a support view is that the WSFC has to be vetted before you create it. Could you run it after? Sure, but you still have no proof you had a valid configuration to start with which is what matters. This is a crucial step for all AGs and FCIs, especially since AGs do not check this whereas the installation process for FCIs does.
If you look at MSFT_xCluster, you’ll see what I am saying is true. It builds the WSFC without a whiff of Test-Cluster. To be fair, this can be done in non-Azure environments, too, but Microsoft givs you warnings not to do that for good reason. I understand why Microsoft did it this way. There is currently no tool, parser, or cmdlet to examine the output of Test-Cluster results. This goes back to why building WSFCs is *very* hard to automate.
Knowing this, I would change all of this to build the AG (or FCI) VMs with the Failover Clustering feature enabled, then validate and build the WSFC inside similar to what is in the workflow for building the AG on your own. So it’s still a mix of automation and some minor human intervention.
MSFT_xCluster also has another issue in my mind in parsing the code: it seems like it only handles Active Directory Domain Services (AD DS)-based WSFCs. If you wanted to build a Workgroup Cluster variant of a WSFC that does not require AD DS, you are out of luck. This is acknowledged in that MS blog post from February I link above, and at least they call it out. Kudos.
We only support AD domain joined Windows Failover Cluster definition. The FQDN is a must have property and all AG replicas should already be joined to the AD domain before they are added to the cluster.
All of this feels a bit like a case of fire, ready, aim, or more specifically – deploy, understand supportability, automate.
Can You Still Automate AG Deployments Using What MS Provided?
If you are looking at non-production environments such as development and QA, use anything and everything I criticize above since supportability generally is not an issue there. You’re not deploying production systems in the truest sense (i.e. end user/customer facing), but keep in mind they are production systems for your developers and testers.
If you build the base Windows Server IaaS VMs and get through validation and want to automate beyond that, you’d have a fully supported solution if building the WSFC and AG portions are fully automated.
That said, if you know what you’re doing, building all of this yourself won’t take much more time and may even take less time – especially the WSFC piece. You can automate it yourself in different ways. Building a WSFC really does just work these days when it’s done right (kudos to the Windows Server dev team and the HA PMs). Do what works for you; if what Microsoft provides works for you, go nuts. Just know there’s more than one way to approach this problem.
The Bottom Line and What Microsoft Needs to Do
Automation has come a long way but we’re not there fully there yet for clustered and supported configurations of SQL Server running on Windows Server up in Azure or any of the public clouds for that matter. Here’s what needs to happen:
- Fix things so that Test-Cluster is run and the output is checked before building the WSFC and the AG.
- Should Microsoft deem it acceptable to support these automated methods already out there for production builds, they need to say that somewhere other than a blog post officially AND update KB2775067 accordingly that the validation requirement is waived. Otherwise there will be conflicting information out there which is bad for everyone including Microsoft. Microsoft needs to stop that nonsense right in its tracks.
- Update any templates and Wizards accordingly.
When and if these things happen, by all means, automate away in Azure even for production!
Need help with your availability solutions, especially if you are looking at any of the public clouds? Contact us today and we can kickstart your projects into high gear.
By: Allan Hirt on February 8, 2019 in Disaster Recovery, Downtime, Outage | No Comments
Disaster recovery is in the news this week for all the wrong reasons.
Stop me if you’ve heard this story before. A major company – in this case a financial institution – is having a technical outage for not only the first, but the second time in less than a week. Assuming you’re not working for Wells Fargo or one of their customers, chances are had a much better time these past few days. These are their recent tweets as of sometime in the afternoon Thursday, February 7. This post went live on Friday morning the 8th and Wells Fargo is still down – basically two days of downtime.
Figure 1. Wells Fargo acknowledging the problem
Ouch. The last tweet is not dissimilar to a British Airways outage back in 2017. This tweet from Wells Fargo claims it is not a cybersecurity attack. Right now that’s still not a lot of comfort to anyone affected.
Figure 2. It’s not a hack!
Their customers are not enjoying the outage, either. Three examples from Twitter:
Figure 3. Customer issue due to the outage
Figure 4. Day two …
Figure 5. More impact
Imagine the fallout: direct deposit from companies may not work, which means people do not get paid. People can’t access their money and do things like pay bills.For some, this could have a lifelong impact on things like credit ratings if you miss a loan, mortgage, or credit card payment. There is not just the business side of this incident.
There were of course, snarky replies about charging Wells Fargo for fees. I wanted to call out an honest-to-goodness impact to someone along with the possible longer term fallout to Wells Fargo themselves. I feel for the person who tweeted, but the reality is Chase or Citi could have an outage for some reason, too. I don’t think anyone is 100% infallible. I have no knowledge of what those other finanical institutions have in place to prevent an outage.
I’m doing what I do now largely because I lived through a series of outages similar to what Wells Fargo is most likely experiencing this week. Those outages happened over the course of about three months. They are painful for all involved. I worked many overnight shifts, a few 24+ hour days … well, you get the picture. I learned a lot during that timeframe, and one of the biggest lessons learned is that not only do you need to test your plans, but you have to be proactive by building disaster recovery into your solutions from day one. All companies, whether you are massive like Wells Fargo or a small shop has the same issues. The major difference is economy of scale.
Everyone has to answer this one question: how much does downtime cost your business – per minute, hour, day, week? That will guide your solution. Two days in a row of a bank being down is … costly.
There may be another impact, too – can Wells Fargo systems handle the load that will happen when the systems are online again? I bet it will be like a massive 9AM test. Stay tuned!
You don’t want to have a week like Wells Fargo. Take your local availablity as well as your disaster recovery strategy seriously. I’ve always found the following to be true: most do not want do put in place proper diaster recovey until their first major outage. Unfortunately at that point, it’s too late. Once the dust settles, they’ll suddenly buy into the religion of needing disaster recovery. Let me be clear – I do not know what the situation is over at Wells Fargo; zero inside information here. Did they have redundant data centers and the failover did not work? Were some systems redudnant but not others? Were there indications long before the downtime event that they missed? There are more questions than answers, and I’m sure we’ll find out in good time what happened. The truth always comes out.
A Different Kind of Mess – Quadriga
This week also saw a very different problem as it relates to finance: the death of Quadriga CEO Gerald Cotten. Why is his passing away impactful? When he died, apparently he was the only one who knew or had the password for the cryptocurrency vault. There are other alleged issues I won’t get into, but with his laptop encrypted (apparently his wife tried to have it cracked), literally no one whose money was in the vault can get it. The amount stored in there I’ve seen in various stories has been different, but it is well north of $100 million. The lesson learned here is that someone else always needs to know how to access systems and where keys are. Stuff happens – including death. There are real world impacts that can happen when systems cannot be accessed.
Watch Your Licenses
Since we are talking about outages, the Register published a story today about how a system would not come up after routine maintenance due to the software license expiring. I have been through this with a customer. We were in the middle of a data center (or centre, for you non-US folks) migration. We had to reboot a system and SQL Server would not come up. I looked in the SQL Server log. Lo and behold, someone had installed Evaluation Edition and never converted it to a real life. It also meant the system was never patched and never rebooted for a few years! Needless to say, there was no joy in mudville. That was a very different kind of outage.
The Bottom Line
If you do not want to be another disaster recovery statistic and prevent things like the above from happening, contact SQLHA today to figure out where you are and where you need to be.
By: Allan Hirt on February 7, 2019 in Conference, SQLbits, Training | No Comments
Hard to believe that SQLBits 2019 is only a few weeks away. I’m looking forward to speaking there again. It’s always an honor to be selected and Bits is one of my favorite conferences to attend if I can make it. This year, it’s in Manchester which is somewhere in the UK I’ve yet to visit, so I’m excited about that as well.
I’m currently finalizing the content and the lab for the Training Day I will be delivering on Thursday, February 28 – Modern SQL Server Availability Architectures. Hopefully the venue can support the lab, so we’ll see. That aspect is completely beyond my control, but what I have cooked up should hopefully be fun if we get to do it. You’ll need to bring your own laptop and make sure you run the test link that is linked in the description. Last I checked, seats were filling up quickly, so don’t miss out!
I’ll also be doing a session on Friday, March 1 – Common Troubleshooting Techniques for AGs and FCIs at 14:25 (2:25 PM for those of you on my side of the pond).
If you haven’t registered already, what are you waiting for? If you have, see you there. Come up and say hello!