I know, I know: the week is technically up. But I got caught up in customer work. So I’ve got another post or two still coming. This one I felt compelled to finish first due to some recent events, most notably the Amazon EC2 cloud outage earlier this week that as of now is still affecting some websites. If you look at that article, some known sites were affected (such as Foursquare).
Know What You’re Protecting … and Its Cost
I often talk to technical-focused people such as DBAs or SAN administrators who really have no idea about what the applications or databases they are hosting and administering are for. On the surface, it shouldn’t matter – right? Wrong.
Whether you’re new to a company and stepping in as a DBA/IT guy to administer some systems or part of the team planning the architecture and infrastructure for a new application or solution, you have to know a bunch of things both technical and non-technical. Of course, there’s the standard stuff like capacity and your overall availability goals, service level agreements, and recovery objectives. However, none of that matters nor does what you put in place if it’s wrong for what you’re trying to protect.
For example: if you are deploying a system that is for a hospital and the information in it is needed 24×7 to be able to make life or death situations (for example, being able to see if a patient is allergic to any medicines), that’s a whole different ballgame than a 9 to 5 shop who has an accounting application that needs to only be up and running in that timeframe. Death is not a consequence you want on your head (nor something a hospital would either; someone would get sued if it really is wrongful death).
I find many companies want to have either “one size fits all” solutions or totally “packaged” offerings for everything. Need to deploy? Use <insert thing here>. This approach will work for most of your deployments, but realize there may be some that don’t fit the mold and no matter how much you try to shoehorn the thing in, it’s really a case of square peg, round hole. The hospital one is a good example of that.
Even if it isn’t loss of life, attach some monetary or other value (such as productivity) on downtime. If the cost of the solution is cheaper than the outage cost, it should be a no-brainer to implement barring any technical issues in your organization. Let me say this: I know cost is an issue – especially in the economy we’re having now. I’m not ignoring it or saying it doesn’t matter. It does. I’m just pointing out that it cannot and should not be the only driving factor.
Who Do You Trust?
Control is an issue when it comes to availability. When you own and control all the pieces of the puzzle such as the servers and data center, you live and die by your own sword.
Having said that, we’re in a day and age where it’s not always cost-effective to build and maintain your own data center. Many companies use a hosting facility, and often times the people there are responsible for a good portion of the maintenance. The cloud is just another abstraction of this if you really think about it. However, with the cloud you arguably giving up even more control and will have less control over many aspects of deployment and administration since there are more layers of abstraction. Some cloud providers even do all the patching. (Disclaimer: I’m not anti-cloud. But I don’t feel it’s quite ready for prime-time, mission critical applications for MOST customers. It may work for some, and I will say what I did with virtualization a few years ago: watch this space as it will affect you in one way or another at some point.)
The key thing when you outsource (and I don’t necessarily mean to some center halfway across the world, although what I’m going to say applies there, too) that you have to remember is that by placing your web server/database/whatever there, it becomes and extension of you. By that, I mean where it’s hosted has to meet the requirements documented as if you were controlling everything in your own data center. When you outsource any of the work or responsibility, you need to have a very comfortable level of trust that those responsible for meeting the availability YOU agreed to can follow through (in addition to ensuring things like security and performance; I’m just focusing on uptime and/or the lack of downtime). If they can’t, it’s your neck on the line.
Years ago I was working onsite at a customer with many others during a disaster they were experiencing. Their servers were at a location where some people had to fly to get there to be able to physically do some work on the servers. No big deal right? If only that were true. When they got there in the evening, the place was closed up tighter than a drum – no 24×7 access. That was not the time to find that out. So they came back in the morning only to find they could go in one at a time. Again, not a surprise you want to discover in server down situations. This example demonstrates what happens when you make assumptions and don’t ask the right questions of your service providers.
Another tip: make sure the company you are outsourcing to supports your core business hours whether they are 24×7 or 9 to 5. It doesn’t matter if they are next door or three continents away, if you can’t contact them and expect a response to meet your SLAs, what’s the point? I can remember transitioning some things to a remote support center half a world away on one engagement and whoever set that up never really told them they needed to work our hours. Huh? So let me get this straight: we were in the USA, but couldn’t get any support until after 8PM at night? I don’t think so.
Plan B – Have One
A lot of my customers over the years never really put disaster recovery in place. They put some primary form of availability along with some backups, and that was that. Disaster recovery is an additional expense (and insurance policy) that often gets cut due to things like time and budget. I always say it’s a matter if when, not if, something will happen. Hardware failure, natural disaster, user fat fingering data – whatever happens, there will be some scenario that can and will take you down. You may not have been able to plan for it nor did your primary form of availability cover it. Welcome to disaster recovery.
You can’t just have redundant systems in one location or possibly geographic area. If an earthquake hits Los Angeles and both of your data centers are there, do the math on that. Lose them both, possibly lose everything. Again, know what you’re protecting against and deal with it properly.
Even if your Plan B is just to have backups, do that. It’s better than nothing. For the really paranoid, you may even want to have Plans C, D, E, F, or more. No, I’m not kidding. It’s a lot of work, but if it saves your company’s proverbial bacon, who looks like a genius or hero? You.
Get To The Point, Will Ya?
So what inspired this post other than the SQLU HA/DR week? A Tweet I read pointed this forum posting for Amazon Web Services EC. I’ll just repost the first post:
Life of our patients is at stake – I am desperately asking you to contact
Posted on: Apr 22, 2011 11:20 PM
Sorry, I could not get through in any other way
We are a monitoring company and are monitoring hundreds of cardiac patients at home.
We were unable to see their ECG signals since 21st of April
Could you please contact us?
Our account number is: xxxx-xxxx-xxxx
Our servers IDs:
Or please let me know how can I contact you more ditectly.
Oy vey! Where do I begin?
First and foremost: YOU NEVER WANT TO BE IN THIS GUY’S SHOES. EVER. Got it? Good. The guy (and I do feel bad for him and his customers; I truly hope no one died) took a proverbial beating in that thread and it was kinda justified. If you know that the outcome of you losing the ability to monitor could be someone’s death, have some sort of redundant provider and/or backup solution. Even if that solution is picking up the phone to check on those needing monitoring. Do something. I don’t even want to think about the legal implications of all of this in the unfortunate event someone did pass away (and I hope no one did) and as some brought up in that thread, compliance with whatever laws in your municipality/state/country/zone/whatever.
I’m not absolving Amazon completely here: they promised a service level the customer relied on and didn’t deliver. As a DBA or IT person, don’t do that either. It’s better to underpromise and overdeliver than the opposite. However, as I said above, even with Amazon’s promises, we’re talking life and death. We’re not talking about losing some fanboy’s music site to his favorite band here for a few days. Amazon I assume was trying to get things online, but lack of communication only makes things worse. If you want a good relationship with your customers, tell them the truth no matter how much it hurts. You may even take a ding or ten. We all have at some point in our career.
One quick thing I’d like to add here: if you are in that guy’s shoes, there’s a very real chance you may need to update your resume sooner rather than later with a blunder like this. I’m not unsympathetic towards that original poster at all; quite the contrary. But it’s major oversights like that which will get you fired.
This example underscores most of the points I made in this blog post:
1. Know why you’re implementing something. Possible death is a whole different level of planning that needs appropriate backup plans. You won’t think of every scenario (see: the Japanese nuclear plant situation), but if you cover the most common ones, you won’t be like this guy. They had no Plan B (or C or D). It was Plan A or bust. It may have seemed like a perfect idea to stick this monitoring application in the cloud, but boy did it bite them in the rump when it failed to deliver the uptime promised. Bet they get some redundancy after this.
2. Communication is key. If you don’t want the mob showing up at your doorstep with pitchforks and torches, be concise, be accurate, and don’t put people off. Bad news is better than no news for hours or days on end.
3. Ask the right questions of your hosting/services/cloud providers if you are going to take that leap of faith and not do the work yourself. If they do not meet your standards, even if what you’re looking at from a cost perspective is better, it is most likely NOT the right solution for you. Cost cannot be the only factor. Again, I only point you to this poor soul from the forum posting: things look great on paper and may even seem like a bargain, but are they really?