Outages In An Increasingly Connected World
I’ve been in the availability game a long time. It seems like at least once a week there is a story about a fairly major outage somewhere in the world due to software, hardware, public cloud failures, human error/incompetence, DDoS, hacking, or ransomware. In the dark ages of IT, when Twitter, Facebook, or any of the other social media platforms did not exist,sometimes you heard about these events. Today, we hear about them in real time. Unfortunately, they can also become PR nightmares, too. Below is a sample of some recent events, all of which happened after January 1, 2017:
- Instapaper (The Register’s story; full postmortem)
- Delta Airlines (Washington Post’s story)
- Australian Tax Office (The Register’s story; what the ATO has to say)
- Licking County, Ohio (Newark Adovcate story)
- BlueMix/SoftLayer (The Register’s story)
- Lloyd’s Bank (one of The Register’s stories)
- GitLab (their own postmortem)
- Code.org (what they had to say)
The costs associated with these problems also seems to be increasing. Computer problems are no longer just small “glitches”. There are real consequences, whether these are man made outages, systems going down for some type of hardware or software failure, or something else completely. The elephant in the proverbial room is that this is all caused by “the (public) cloud”.<insert picture of old man shaking his fist> This is not true in many cases, but let’s be clear: public cloud providers have had their missteps, too. Technology is only as good as the humans implementing it and the processes around them. Technology won’t save you from stupidity.
Let’s examine some of these very public failures.
The ATO has had some high profile outages over the last year, a lot of them storage related. Their latest run in with trouble was just a few days ago. This outage didn’t have data loss associated with it, but if you look at the ATO’s statement of February 8th, it’s pretty clear they are unhappy with their vendor. They’ve even hired an additional firm to come in and get to the bottom of things. In other words: in addition to all of the inconvenience and impact to end users and taxpayers, they’re spending more money to figure out what went wrong and hopefully fix the problem.
We’re database folks. Storage is fundamental to us in three ways: capacity (i.e. having enough space), performance (helps get those results back quicker, among other things), and availability (no storage, no database). You need to know where you are in relation to all of those for your database solutions – especially the mission critical ones. You also need to have a good relationship with whoever is responsible for your storage. I’ve been involved with and heard too many horror stories around storage outages and the mess they caused, some of which could have been prevented with good communication.
Ah, GitLab. What started out as something well intentioned (putting servers up in a staging environment to test reducing load), became the perfect storm. Before I go any further, let me say up front I applaud their transparency. How many would admit this?
Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.
Hoping they could restore the database the engineers involved went to look for the database backups, and asked for help on Slack. Unfortunately the process of both finding and using backups failed completely.
Read the “Broken Recovery Procedures” section of that postmortem. It is very telling. Things went nuclear and recovery was messy, but in my opinion, something that could have been avoided. Most failures of this magnitude are always based in poor processes in place – it’s something I see time and time again. Kudos to GitLab owning it and committing to fixing them, but assumptions made along the way (and you know what they say about assumptions …) helped seal their fate. This cautionary tale highlights the importance of making backups and ultimately, restores.
Wait, There Are Limitations?
A few of these tales of woe are related to not knowing the platforms used.
I can’t say whose fault it was (developers? person who purchased said product? There could be lots of culprits …), but look at Instapaper. Check out the Root Cause section of the post mortem: they went down because they hit, and then exceeded, the 2TB file size limit that existed in an older verson of their underlying platform. That is something that anyone who touched that solution should have known, but had specific monitoring in place so that when things got close, it could be mitigated. I call a bit of shenanigans on this statement, but applaud Brian taking full responsibility at the end:
Without knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue.
It’s your job to ask the right questions when you take over (hence the accountability part). Now, to be fair, it’s a legitimate gripe assuming what is said is true and RDS does not alert you. Shame on Amazon if that is the case, but situations like that are the perfect storm of being blind to a problem that is a ticking time bomb.
Similarly, Code.org suffered the same fate. I’m not a developer by nature, but even I know that 4 billion lines of code is not a lot, especially when you have a shared model. Their own webpage alludes to over 20 billion lines of code written on the platform. Their issue is that they were using a 32-bit index which had a max of 4 billion rows of coding activity, and had no idea they were hitting their limit. I would bet that some dev along the way said, “Hey, 4 billion seems like a lot. We’ll never hit that!” Part of the fix was switching to a 64-bit index which holds more (18 quintillion rows), but famous last words …
On the plus side, this new table will be able to store student coding information for millions of years.
A Very Real World Example
In the US right now, there is the chance of a major catastrophe in Oroville, CA because of the Oroville Dam. There is nothing more serious than possible loss of life and major impact on human lives. I was reading an article “Alarms raised years ago about risks of Oroville Dam’s spillways” on the San Francisco Chronicle site, and like a lot of other things, it appears that there is a chance this all could have been avoided. Of course, with things of this nature, there’s a political aspect and a bit of finger pointing (as there can be in businesses, too), but here is a quote I want to highlight
Bill Croyle, the agency’s acting director, said Monday, “This was a new, never-happened-before event.”
It only takes once. I’ve seen this time and time again at customers for nearly twenty years. Nothing is a problem … until it’s a problem. No source control and you can’t roll back an application? No change management and updates kill your deployments and you have to rebuild from bare metal? You bet I’ve seen those scenarios and customers implemented source control and change management after.
Let me be crystal clear: hundreds of thousands of people and animals being displaced is very different than losing a few documents or some data. I have no real evidence they did not do the right repairs at some point (I’m not an Oroville Dam expert, nor have I studied the reports), and yes, there is always something that you do not know that can take you down, but statements like that look bad.
Pay Now or Pay Later – Don’t Be The Headline
Outages are costly, and to fix them will take more time, money, and possibly downtime to fix. Here are five tips on how to avoid being put in these situations and giving yourself a potential resume generating event:
1. Backups are the most important task you can do as an administrator. At the end of the day, that may be all you have when your fancy platform’s features fail (for any number of reasons, including implementing them incorrectly). More important than generating backups is testing them. You do not have a good backup without a successful restore. With very large sets of data (not just SQL Server databases – data can mean much more such as files associated with metadata that is in a RDBMS), finding ways to restore is not trivial due to the costs (storage, time, etc.). However, when you are down and possibly closed for good, was it worth it the risk only to find out you have nothing? No.
2. Technical debt will kill you. Having systems that are long out of support (and probably on life support) and/or on old hardware is a recipe for disaster. It is definitely a matter of when, not if, they will fail. Old hardware will malfunction. If you’re going to have to search eBay for parts to resuscitate a server, you’re doing it wrong. Similarly, for the most mission critical systems, planned obsolescence every few years (usually 3 – 5 in most companies) needs to be in the roadmap.
3. Have the right processes in place. Do things like test disaster recovery plans. How do you know your plans work if you do not run them? I have a lot to say here but that’s a topic for another day.
4. Assess risk and mitgitate as best as possible. Don’t bury your head in the sand and act surprised when something happens. (“Golly gee, we’ve never seen this before!”)
5. Making decision on bad information and advice will hurt you down the road. I can’t tell you how many times Max and I have come in after the fact and clean up somebody else’s mess. I don’t relish that or get any glee from it. My goal is to just fix the problem, but if you have the wrong architecture or technology/feature implemented, it will cause you years of pain and a lot of money.
Are you struggling with any of the above? Have you had or do you currently have availability issues? Contact us today so SQLHA can help you avoid becoming the next headline.