The Tube: A Paradigm for Capacity Management, Performance, Availability, and Disaster Recovery
Howdy everyone! I’m back in the good ol’ US of A after my little jaunt in Europe and trying to get caught up with the e-mail and work that piled up. I posted a bit about Germany in my last blog post, but I have to say that when I got back to the UK, SQLBits X was wonderful. Dare I say it was better than SQLBits VIII in Brighton last year. Thanks to everyone who came out for my Training Day and to my one session last Friday. I hope to attend and speak at another SQLBits next year.
Anyway, while I was in the UK I was up one night going over my presentation and lo and behold, I saw a show on the Tube (nee London Underground). It was one episode of an ongoing series on BBC Two. As a train enthusiast, I was enthralled. I’ve been into trains since I was a little kid. The show is based around what happens behind the scenes to make I then went online and saw I could watch previous episodes online. By the time I checked out of my hotel room on Sunday I had seen all six available. It’s a fascinating look at all the things that need to be done to ensure that millions of people can continue to be transported every day. One episode in particular – number 4 – stuck out at me.
However, in spite of my own mind which wanted to enjoy it only at a surface level, I started thinking about the day job and how similar it was. First and foremost, the Tube is in the middle of a 10 billion pound upgrade to its system to improve service that will take years to complete. All of the work needs to be done while still servicing customers, but at the same time, look at the future and increase its capacity. Of course there are going to be inconveniences and closures at times on parts of the system. This is something we need to account for in IT, too.
Currently I’m finishing up the HA chapter of the SQL Server 2012 Upgrade Guide which will hopefully be out soon and it was like kismet watching this. At some point you’re just going to incur downtime and the business is going to have to deal with it. You need to carefully schedule these outages, but it still may be an inconvenience. Seeing the Underground closing lines on weekends to replace sections of track and then stressing out only hours before having to open on a Monday morning for people to be able to get to work is the same as we have to do. Any number of things can happen between the time you start and finish to slow you down or derail you.
Capacity management is one of those things very few do well out there, be it sizing for new systems or understanding where you are today to better understand how to manage right now. The Tube has to account for the number of riders every day and in each episode they gave specific numbers of strain on the system. I forget what they were referring to exactly (a station/line/whole system), but at one point they were estimating 4,000,000 riders by 2016), but they hit that in 2011. So even planning and estimating doesn’t help sometimes. All throughout the episodes they were giving statistics on riders and how the numbers have increased in recent years. It’s the same for us – acquire a company, hire more people, whatever – the more people that use a system, the higher the load. EVERYTHING needs to work right to be in balance or else it could all go south very quickly. Consider these two quotes from Episode 4:
“The Tube, in a sense, generates its own traffic. As soon as you upgrade something, as soon as you put in another couple of trains per hour, you find that the capacity is taken up … The more you expand, the more people use it.”
“As soon as you fix this congestion point, there’s another one along the line somewhere else to fix. So it’s a never ending task. What we have to sometimes do is deliver on the impossible.”
Do these quotes sound familiar to you? They should. It’s what you live every day. The first quote addresses what I was talking about above. The second quote is not unlike that traditional bottleneck triangle we always talk about with computer systems (I/O and disk, CPU, and memory). Fix one, and some other will appear. This is why you’re constantly monitoring and dealing with your systems. But of course you never need a DBA or sysadmin – SQL Server manages itself, right?
What was fascinating to watch was the central control around the number of trains in the system on any given line and how a backup or delay in one place causes a downstream effect. Also, if they have too many trains in play to handle capacity, if there is a backup, they may not have enough stations to accommodate them. I forget which line they were showing, but they had 31 trains in play but capacity for 27 at stations. Things stack up. This speaks to what I talk about with multiple clustered instances of SQL Server on a Windows failover cluster – you need to think of the failover condition. Could one node run all instances? And if not, which won’t work?
Speaking of not needing a DBA, it was very interesting to see the daily maintenance they do on the trains and stations just to keep the system functioning – again, not unlike the tasks we need to account for. From picking up trash and cleaning windows to removing dead animals and fixing trains themselves (and everything inbetween, including fatalities), it ALL needs to be done. It’s not a matter of if, but when these things need to be done. You can’t ignore them or else you’ll have to take a bigger outage to fix it all.
Last, but certainly not least, testing loomed large. They were rolling out new trains on the Victoria line and elsewhere. They had two major problems after introducing them into the system:
- The new Victoria line trains have sensitive edges to ensure that no one gets caught in the door, but the settings were found to be too sensitive (we’re talking the difference of millimeters) which caused a lot of delays which had that downstream effect I was talking about. If the doors can’t shut right at one station, the whole line gets backed up. The example shown was at the Seven Sisters station, but the problem manifesting itself at Oxford Circus where they had to shut the gates to not allow anyone else into the station there or onto the platform, causing unhappy riders in addition to the backups and delays. No one is happy there – including those responsible for the Tube. This one problem caused nearly 25% of all delays that one week.
- There was a software glitch on another set of new trains on another line that was identified but needed to be fixed.
The upper management from the Tube got on the phone with the builders (also in the UK) and the meeting was like some of those I had and continue to have with customers. I always say production should never be a guinea pig (especially when it comes to things like patching and updating; this happens often in clustered environments). Clearly the Tube management was not happy and this one quote stuck out when the manufacturer was talking about figuring out the problems, and the management retorted:
“I know that’s what we’re doing but we’re doing it in service!”
AMEN! What that conversation led to was one that things need to be better tested. Testing is crucial, and something I espouse all the time in anything I write or do. It’s one of the most important things you can do before going into production to ensure minimal-to-no downtime. ’nuff said.
The funny thing is to see the customers of the Tube saying things like “I bet I could do this in an hour”. I’m sure your end users and the business think that what you do is simple – just hit a button or install a widget and you get the /faster, /morecapacity, or /betteravailability switches. What they don’t see, for example, is the complexity of completely removing miles of track over a weekend, replacing them, and then having things working on Monday like nothing happened. What they do on the UNderground and what we do is not always a walk in the park.
This series is HIGHLY recommended to watch if you’re in the UK (especially Episode 4). Look beyond the Tube and see its meaning for your daily job. Unfortunately we can’t watch it outside of the UK which is a real shame.