By: Allan Hirt on April 19, 2017 in Availability Groups, SQL Server 2017 | No Comments
Well, it’s official – SQL Server v.Next which has been in preview is going to be SQL Server 2017. This was announced at the DataAmp event. While we do not have an release date, it’ll be sooner rather than later. CTP 2.0 was just released today, so go and play with it.
There are two improvements for AGs in CTP 2.0 over 1.4 for listeners on Linux:
- The listener will now be persisted post-failover (I have yet to test this … will do so soon)
- Read only routing should be working in this releas
In other words, the listener should work properly now in CTP 2.0.
Also, there is a new CLUSTER_TYPE called External. I’ll blog more about this another time, but this is a follow on to the AGs that do not need an underlying WSFC or Pacemaker cluster that I blogged about here.
That’s it – not a long post, but all good stuff.
By: Allan Hirt on April 11, 2017 in Always On, Availability Groups, High Availability, SQLHAU | No Comments
Happy Tuesday. I just wanted to make sure everyone took advantage of the great discount SQLHA is offering on our two upcoming classes in Chicago this August:
The Boot Camp features our world famous labs which everyone raves about. There is no better way to learn than with hands on experience that is custom designed for this course in addition to the instructional content.
Through this Saturday, April 15, both classes are on sale for 25% off the list price. After that, there is still a discount, but this is an Early Bird special. This means that SQL Server Fundamentals would be $521.25 (list price: $695) and the Boot Camp $1646.25 (list price: $2195).
For an even bigger discount, sign up for both classes and you will get 30% off the total price, which means that for four days of SQL Server availability training, you’ll pay $2023 – less than the list price of just the Always On Availability Groups Boot Camp.
Click on the links above for more information and how to register.
By: Allan Hirt on February 22, 2017 in Always On, Availability Groups, Linux, Pacemaker, SQL Server V.Next, Windows Server Failover Cluster | 1 Comment
UPDATED 2/22/17 in the afternoon
With a lot of the focus seemingly going to the Linux version of SQL Server v.Next (including the inclusion of Always On Availability Groups [AGs] in the recently released CTP 1.3) , I don’t think that a lot of love is being showered on the Windows side. There are a few enhancements for SQL Server availability, and this is the first of some upcoming blog posts.
A quick history lesson: through SQL Server 2016, we have three main variants of AGs:
- “Regular” AGs (i.e. the ones deployed using an underlying Windows Server failover cluster [WSFC] requiring Active Directory [AD]; SQL Server 2012+)
- AGs that can be deployed without AD, but using a WSFC and certificates (SQL Server 2016+ with Windows Server 2016+)
- Distributed AGs (SQL Server 2016+)
SQL Server v.Next (download the bits here) adds another variant which is, to a degree, a side effect of how things can be deployed in Linux: AGs with no underlying cluster. In the case of a Windows Server-based install, this means that there could be no WSFC, and for Linux, currently no Pacemaker. Go ahead – let that sink in. Clusterless AGs, which is the dream for many folks (but as expected, there’s no free lunch which I will discuss later). I’ve known about this feature since November 2016, but for obvious reasons, couldn’t say anything. Now that it’s officially in CTP 1.3, I can talk about it publicly.
Shall we have a look?
I set up two standalone Windows Server 2016 servers (vNextN1, vNextN2). Neither is connected to a domain. Figure 1 shows the info for vNextN1 (for all pictures, click to make bigger).
Figure 1. Standalone server not domain joined
Using Configuration Manager, I enabled the AG feature. Prior to v.Next, you could not continue at this point since there was no WSFC; you would be blocked and get an error. However, in v.Next, you will get what is seen in Figure 2. It still indicates that there is no WSFC, but it happily allows you to enable the AG feature.
Figure 2. Enabling the AG feature in v.Next
After enabling and restarting the instance, you can clearly see in Figure 3 that the AG feature is enabled. We’re not in Kansas anymore, Toto.
Figure 3. AG feature is enabled, but no WSFC
Just to prove that there is no magic, if you look in Windows Server, the underlying feature needed for a WSFC is not enabled which means a WSFC cannot be configured. This is seen in Figure 4.
Figure 4. Failover clustering feature is not installed – no WSFC!
In SQL Server, configuring this is done via T-SQL and is similar to how it is done for AD-less AGs with Workgroup WSFCs in SQL Server 2016/Windows Server 2016. In other words, you’re using certificates. In addition to certificates, there is a new clause/option (CLUSTER_TYPE) that exists in the CREATE and ALTER AVAILABILITY GROUP T-SQL. Unlike the Linux example in documentation which shows how to use the CLUSTER_TYPE syntax, I altered the syntax I’ve been using for the AD-less AGs with certificates since it is basically the same and I did not use seeding (you can if you want); I manually restored the database AGDB1. I created an AG called AGNOCLUSTER. This can be seen in Figures 5 and 6.
Figure 5. vNextN1 as the primary replica of AGNOCLUSTER
Figure 6. vNextN2 as the secondary replica of AGNOCLUSTER
To support this new functionality, there are new columns in the DMV sys.availability_groups – cluster_type and cluster_type_desc. Both can be seen in Figure 7. You will also get an entry in sys.availability_groups_cluster with this new cluster_type (also a new column there).
Figure 7. New columns in sys.availability_groups
So what are the major restrictions and gotchas?
- This new AG variant is NOT considered a valid high availablity or disaster recovery configuration without an underlying cluster (WSFC for Windows Server or currently Pacemaker on Linux). It is meant more for read only scenarios, which means it’s more meant for Enterprise Edition than Standard Edition. I cannot stress enough this is NOT a real HA configuration.
- UPDATE – A major reason this is not a real HA configuration is that there is no way to guarantee zero data loss without you first pausing the primary and ensuring that the secondary replica is in sync (or replicas, as the case may be).
- Having said #1, you can do a manual failover from a primary to a secondary. This would be true even in the case of an underlying server failure, hence this not being really a true availability configuration: there’s no cluster and the mechanism (sp_server_diagnostics) to detect and handle the failure.
- Since there’s no underlying cluster, you can’t have a Listener. This should be painfully obvious. This also means that you will connect directly to any secondary replica for reading. This makes it possibly less interesting for read only scenarios, but again, did you expect you’d get everything?
- Since there is no Standard Edition version of the CTP, it is unknown if this will work with Standard Edition in SQL Server v.Next. I would assume it will, but we’ll see when v.Next is released.
- UPDATE – This also can most likely be used for migration scenarios (arguably it will be the #1 use), which I will talk about in a Windows/Linux cross platform blog post soon.
- UPDATE – This is not the replacement for database mirroring (DBM). That is/was putting AGs in Standard Edition in SQL Server 2016 even though it requires a WSFC. You get so much more that you really should stop using DBM. It’s been deprecated since SQL Server 2012 and they could pull it at any time (and I’m hoping it’s gone in v.Next).
Keep in mind this is how things are now, but I don’t see much, if anything, changing whenever RTM occurs.
I won’t really be talking about this configuration this weekend at SQL Saturday Boston or at SQL Saturday Chicago in March, but I will be talking about it a little bit at both the SQL Server User Group in London on March 29th and at the SQL Server User Group in Dublin on April 4. I will be covering this configuration in more detail as part of my Training Day at SQL Bits 16 – “Modern SQL Server Availability and Storage Solutions” (sign up for it today – seats going fast!). It will also be part of my upcoming new SQLHAU course Always On Availability Groups Boot Camp coming up in August, and incorporated into the 4-day Mission Critical SQL Server class (next scheduled delivery is in December).
By: Allan Hirt on February 14, 2017 in Administration, Business Continuity, Data Loss, Disaster Recovery, High Availability, Mission Critical | No Comments
I’ve been in the availability game a long time. It seems like at least once a week there is a story about a fairly major outage somewhere in the world due to software, hardware, public cloud failures, human error/incompetence, DDoS, hacking, or ransomware. In the dark ages of IT, when Twitter, Facebook, or any of the other social media platforms did not exist,sometimes you heard about these events. Today, we hear about them in real time. Unfortunately, they can also become PR nightmares, too. Below is a sample of some recent events, all of which happened after January 1, 2017:
The costs associated with these problems also seems to be increasing. Computer problems are no longer just small “glitches”. There are real consequences, whether these are man made outages, systems going down for some type of hardware or software failure, or something else completely. The elephant in the proverbial room is that this is all caused by “the (public) cloud”.<insert picture of old man shaking his fist> This is not true in many cases, but let’s be clear: public cloud providers have had their missteps, too. Technology is only as good as the humans implementing it and the processes around them. Technology won’t save you from stupidity.
Let’s examine some of these very public failures.
The ATO has had some high profile outages over the last year, a lot of them storage related. Their latest run in with trouble was just a few days ago. This outage didn’t have data loss associated with it, but if you look at the ATO’s statement of February 8th, it’s pretty clear they are unhappy with their vendor. They’ve even hired an additional firm to come in and get to the bottom of things. In other words: in addition to all of the inconvenience and impact to end users and taxpayers, they’re spending more money to figure out what went wrong and hopefully fix the problem.
We’re database folks. Storage is fundamental to us in three ways: capacity (i.e. having enough space), performance (helps get those results back quicker, among other things), and availability (no storage, no database). You need to know where you are in relation to all of those for your database solutions – especially the mission critical ones. You also need to have a good relationship with whoever is responsible for your storage. I’ve been involved with and heard too many horror stories around storage outages and the mess they caused, some of which could have been prevented with good communication.
Ah, GitLab. What started out as something well intentioned (putting servers up in a staging environment to test reducing load), became the perfect storm. Before I go any further, let me say up front I applaud their transparency. How many would admit this?
Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.
Hoping they could restore the database the engineers involved went to look for the database backups, and asked for help on Slack. Unfortunately the process of both finding and using backups failed completely.
Read the “Broken Recovery Procedures” section of that postmortem. It is very telling. Things went nuclear and recovery was messy, but in my opinion, something that could have been avoided. Most failures of this magnitude are always based in poor processes in place – it’s something I see time and time again. Kudos to GitLab owning it and committing to fixing them, but assumptions made along the way (and you know what they say about assumptions …) helped seal their fate. This cautionary tale highlights the importance of making backups and ultimately, restores.
Wait, There Are Limitations?
A few of these tales of woe are related to not knowing the platforms used.
I can’t say whose fault it was (developers? person who purchased said product? There could be lots of culprits …), but look at Instapaper. Check out the Root Cause section of the post mortem: they went down because they hit, and then exceeded, the 2TB file size limit that existed in an older verson of their underlying platform. That is something that anyone who touched that solution should have known, but had specific monitoring in place so that when things got close, it could be mitigated. I call a bit of shenanigans on this statement, but applaud Brian taking full responsibility at the end:
Without knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue.
It’s your job to ask the right questions when you take over (hence the accountability part). Now, to be fair, it’s a legitimate gripe assuming what is said is true and RDS does not alert you. Shame on Amazon if that is the case, but situations like that are the perfect storm of being blind to a problem that is a ticking time bomb.
Similarly, Code.org suffered the same fate. I’m not a developer by nature, but even I know that 4 billion lines of code is not a lot, especially when you have a shared model. Their own webpage alludes to over 20 billion lines of code written on the platform. Their issue is that they were using a 32-bit index which had a max of 4 billion rows of coding activity, and had no idea they were hitting their limit. I would bet that some dev along the way said, “Hey, 4 billion seems like a lot. We’ll never hit that!” Part of the fix was switching to a 64-bit index which holds more (18 quintillion rows), but famous last words …
On the plus side, this new table will be able to store student coding information for millions of years.
A Very Real World Example
In the US right now, there is the chance of a major catastrophe in Oroville, CA because of the Oroville Dam. There is nothing more serious than possible loss of life and major impact on human lives. I was reading an article “Alarms raised years ago about risks of Oroville Dam’s spillways” on the San Francisco Chronicle site, and like a lot of other things, it appears that there is a chance this all could have been avoided. Of course, with things of this nature, there’s a political aspect and a bit of finger pointing (as there can be in businesses, too), but here is a quote I want to highlight
Bill Croyle, the agency’s acting director, said Monday, “This was a new, never-happened-before event.”
It only takes once. I’ve seen this time and time again at customers for nearly twenty years. Nothing is a problem … until it’s a problem. No source control and you can’t roll back an application? No change management and updates kill your deployments and you have to rebuild from bare metal? You bet I’ve seen those scenarios and customers implemented source control and change management after.
Let me be crystal clear: hundreds of thousands of people and animals being displaced is very different than losing a few documents or some data. I have no real evidence they did not do the right repairs at some point (I’m not an Oroville Dam expert, nor have I studied the reports), and yes, there is always something that you do not know that can take you down, but statements like that look bad.
Pay Now or Pay Later – Don’t Be The Headline
Outages are costly, and to fix them will take more time, money, and possibly downtime to fix. Here are five tips on how to avoid being put in these situations and giving yourself a potential resume generating event:
1. Backups are the most important task you can do as an administrator. At the end of the day, that may be all you have when your fancy platform’s features fail (for any number of reasons, including implementing them incorrectly). More important than generating backups is testing them. You do not have a good backup without a successful restore. With very large sets of data (not just SQL Server databases – data can mean much more such as files associated with metadata that is in a RDBMS), finding ways to restore is not trivial due to the costs (storage, time, etc.). However, when you are down and possibly closed for good, was it worth it the risk only to find out you have nothing? No.
2. Technical debt will kill you. Having systems that are long out of support (and probably on life support) and/or on old hardware is a recipe for disaster. It is definitely a matter of when, not if, they will fail. Old hardware will malfunction. If you’re going to have to search eBay for parts to resuscitate a server, you’re doing it wrong. Similarly, for the most mission critical systems, planned obsolescence every few years (usually 3 – 5 in most companies) needs to be in the roadmap.
3. Have the right processes in place. Do things like test disaster recovery plans. How do you know your plans work if you do not run them? I have a lot to say here but that’s a topic for another day.
4. Assess risk and mitgitate as best as possible. Don’t bury your head in the sand and act surprised when something happens. (“Golly gee, we’ve never seen this before!”)
5. Making decision on bad information and advice will hurt you down the road. I can’t tell you how many times Max and I have come in after the fact and clean up somebody else’s mess. I don’t relish that or get any glee from it. My goal is to just fix the problem, but if you have the wrong architecture or technology/feature implemented, it will cause you years of pain and a lot of money.
Are you struggling with any of the above? Have you had or do you currently have availability issues? Contact us today so SQLHA can help you avoid becoming the next headline.