It’s the day after Sandy had its way with the Northeast, and the results are not pretty. Part of the Boardwalk in Atlantic City? Destroyed. Property damage, especially in the coastal areas of New Jersey and the immediate New York City area? Lots of it. Flooding? Yes. I could go on, but you get the point. While there are still many people here in MA and New England without power, we thankfully did not see the brunt of the storm. Luckily, both my parents and sister (and her husband) are OK in New Jersey. They are not right near Atlantic City and there’s some damage around where they both live, but all things considered, things should hopefully return to some level of normal within a few days.
The pictures of the flooding in lower Manhattan and the entire area are both riveting and devastating at the same time (look at LaGuardia airport). It’s hard not to watch. I was in NYC just a few months ago (and in some of that affected area), so it’s incredible to see. What does bother me though, are the yahoos who thought they would be OK. What part of mandatory evacuation do you not get? Before you call me unsympathetic, I do fully understand that there are people who have no ability (due to money or other reasons) to just go. I’m not talking about them. I’m talking about the yahoos you saw on Nightline or in the news who stayed in their apartment buildings to ride things out. Irene wasn’t enough for you? In this war, Mother Nature will always win. Period. Darwin in full effect.
I focused more on the human aspect in yesterday’s blog post and I think that at this time, all I can say is my prayers and thoughts go out to those affected. Today I’m going to focus on the business continuity of a catastrophic event like Sandy.
If you watched any of the coverage, you saw that a good portion of Manhattan did not have power after a point last night (and still doesn’t). Some of the outage was intentional. It was a preventative measure. You want to talk about a tough decision? That’s a disaster recovery plan in action. It’s like those movies – sacrifice one to save many. Their hope was to minimize the effect of the flooding. We’ll see in time if that was the right call. But that brings other things into focus – like the flooded data center that affected sites like Huffington Post and Gawker (not a commentary on either site, just stating facts), or the failing backup generator at NYU Langone Medical Center which forced over 200 patients to be evacuated. More on that scene in a bit …
We’ve seen a lot recently how cloud outages (*cough* EC2, and more than once *cough*) have brought many folks down, but things like generator failures and floods are a whole different level of downtime hell. In the case of the flooded data center, the flooding was causing issues in allowing the backup generators to kick in because the fuel could not be pumped (they run on gas, just incase you don’t know). Oops. It’s like I say all the time – you can only account for so much. You’re not going to get every “what if” scenario. Even backup plans fail. You still need to account for that in some way.
This brings up a whole other dimension: crumbling, old, outdated infrastructure. There’s a lot of it out there in the world (roads, bridges, etc.), and not just in computing. The focus here will be NYC. Most of the infrastructure there is old (some of it 100 or so years old). See this article, and specifically the third paragraph. The whole article is a good read. NYC learned a lot (unfortunately) from 9/11 and Irene, and I think as a country, we’re more aware after things like Katrina, but NYC is still a mess today and will be for a few days. You can’t fight Mother Nature, but you need to protect yourself better. Just look at the tunnels (cars and subway) that were flooded. That one picture with the water rushing into the station through the elevator shaft was amazing. It’s going to be days – if not weeks – before public transit service is fully operational in the greater NYC area. Once all of the saltwater is pumped out, everything needs to be inspected, repaired, and possibly replaced. Then it needs to be testing (ah, my old friend testing). That’s not only to ensure that things work, but safety. I believe something like 8,000,000 people use the MTA/PATH/LIRR/etc. every day.
Think about your current deployments and architectures, as well as your data centers. How much of it is based on older stuff that is no longer supported and you’re keeping going by spit and chewing gum? I would venture to say that everyone has some of that, even those that are fairly up to date on most things. There’s something in your infrastructure you’re worried about. You’re lying if you say otherwise. Governor Cuomo has the right idea – things are changing, so something needs to be done. You can’t rely on what worked 10 (or 100) years ago when the parameters change. Why do I talk about this? Well, according to Bloomberg, the NYU Hospital’s board of directors knew that the generators were not up to snuff even before Sandy.
Here’s where I get angry. We’re not talking about processing orders for widgets. We’re talking literally about life and death for some if there is no electricity. When I work with customers, this is why we need to understand the effect of downtime. Architecting for avoiding death is WAY different than a missed transaction. Federal and other regulations aside for transactions in certain industries, death is no joke.
I see this all the time with customers, too. I’m not singling anyone out, and I hear it when I speak to people at conferences, too. What happened at NYU in my opinion is NOT acceptable risk. You need to mitigate that. Use all the excuses you want – and money is usually #1, and I get things done right are not cheap – but you can’t have your cake and eat it, too, here. We all know of old SQL Server 2000 and Windows deployments because there are those apps in your data center “which just can’t be migrated/upgraded”. I call malarkey. At some point, you need to replace it. Sometimes the problem is money – going to a new version isn’t free (software cost, cost to migrate, new hardware, training on the platform). Other times no one knows much about it so they believe it’s best to leave it alone. Other times the software is no longer made and no one has spent time looking for a viable replacement. There are sometimes valid reasons not to retire a system, but everything at some point needs to have a plan to move forward. Otherwise you are so far behind you’ll never get caught up. Old is great until it breaks. Then it’s a nightmare.
Case in point, I was talking to someone recently who told me the story of one company that literally bought multiple systems and configured them the same way. As one died, they just put another into service. I believe they are down to two left, and they’re now just starting to look for a new solution. If I’m not mistaken this is a SQL Server 6.5-based application. This stuff exists, folks … and it may be in your own backyard.
Now is a great time to assess where you are in terms of disaster recovery. Let us help you.