Today’s network outage with Delta has led to hundreds of canceled flights and thousands of displaced passengers. Systems are mostly back online now after taking a break for 6-ish hours and planes and people are moving again. Beyond the normal advice in this sort of situation – be nice to front-line crew, it isn’t their fault; call an international res line to try to sneak to the head of the queue, etc. – I have three other observations.
It wasn’t a power failure
I have no doubt that the power to Delta’s data center failed at some point as the company and Georgia Power both acknowledge that. But saying that’s the cause of the outage rather than a catalyst or contributing factor or the on-set event or any number of other things simply belies the reality of network and data center design. Yes, the power failed. And then so did many other systems and/or processes. It is those systems which were the real issue here. Data centers aren’t designed with a single power source. The best ones aren’t designed with a single point of entry for power into the building. Maybe it was the power transfer switch that failed or a generator didn’t come online or any of a number of other things. But it was not “a power failure” that caused the outage. That’s misdirection from someone trying to oversimplify. And also probably trying to save their job.
Watch for Unaccompanied Minors
In its afternoon update around 1:30p EDT Delta mentioned several “things customers should know” about the outage. At the end of the list was this item about unaccompanied minors:
5. Unaccompanied Minors that have not yet begun their travel today will not be accepted for their flights. These customers will be able to rebook without fees for a later date.
This is the company’s way of saying it still does not have full confidence in the ability to operate the schedule so it is not willing to take on the liability of caring for kids, too. I applaud the airline for not taking that risk; that’s smart business at this point. But it also is something everyone else should keep an eye on towards what the next steps will be in the recovery path. When that restriction is lifted it means the company is far more confident in its operations.
Notes from @Delta about today's outage recovery. The last one is most significant IMO. #AvGeek pic.twitter.com/YvePzUP3AK
— Seth Miller (@WandrMe) August 8, 2016
This may be the first time Delta has to pay out to corporate customers based on its Operational Performance Commitment program. Launched a year ago, the program requires Delta to pay up if it has more domestic mainline flights late and canceled than both American Airlines and United Airlines in a given month. A day like today will certainly hurt on that front and depending on how long the recovery stretches into the week the impact could be significant. Then again, it is only for domestic mainline and AA’s numbers haven’t been all that great even without systems issues so odds are Delta will skate by here anyways unless things are miserable into Thursday or so.
Never miss another post: Sign up for email alerts and get only the content you want direct to your inbox.
Recovery counts. At that, DL usually excels. I wish them well “from across the field”.
i wonder if flights to and from EU countries will generate a lot of payments based on the EU compensation requirements.
I cannot see a reason why this would not qualify under EU rules.
A few thoughts, and these are based on my experience working at Worldspan 16 years ago (we shared datacenter building with Delta Technologies). The power situation in that building was robust, even back then, so it’s clear that any power outage damaged critical systems. The diesel generator plant outside the building was enormous. Power entered the building at multiple points. DL and Worldspan handled so much data it was staggering but I’m really confused why, in this age of redundant systems/datacenters/data there wasn’t coverage for this.
Something really bad happened, let’s hope DL fesses up about what the actual problem was.
Yeah, a world class datacenter will have multiple power inputs as well as multiple fiber lines for data I/O. And also inline uninterruptible power supply batteries to hold things up while the generators spin up. Even a dot-com startup ought to have a robust enough production infrastructure to have a hot-standby failover DC in a not-at-all-nearby location and should be able to migrate traffic to the backup in a handful of minutes.
Between this and the WN issues earlier, I’m contemplating offering myself up to airlines as a consultant 😉
Comparing to a startup is not necessarily fair these days given that most of those run on 3rd party hardware, too. And even EC2 has had some notorious outages in the past year or two.
Legacy systems are harder to mirror and manage in that sense. This is not easy stuff. But blaming it on a “power failure” is a joke of misdirection IMO.
Completely agree. Ancient systems are hard to replicate but even just based on their setup 16 years ago they should have been able to handle a power outage let alone with all the improvements they ‘should’ have made these past years. Something much worse/moronic happened.
I worked for a company a few years ago that was housed in a data center that was fairly new, in Seattle. The Emergency Power Off buttons were scattered throughout the facility but were not covered. A contractor leaned on one, powered down the facility, then realized he’d done something bad and pulled the plunger button back out…and then didn’t fess up. Needless to say it took awhile to bring systems back online and they quickly put hard plastic bubbles over the buttons… Whatever DL’s issue was it should not have taken that long to come back online.
This seems to be a rather common issue in the airline industry. Happened to AS, DL, UA, AA, WN and many others with some getting hit multiple times whether it is a 30 minute blip or drags several hours. I am not an IT expert, but from an outsider perspective I think the most perplexing part for me is why they (or SABRE/Travelport) do not have a secondary standby DC in a different hub city.
My company owns 4 DCs across the US and in each DC we have two full sets of PROD servers (A and B, also color coded as Green and Blue) with its own coolant system, dedicated generator, water tank, data lines, separate air locks, etc. Everything in each DC is mirrored (identical servers, identical set up, identical application, and so forth), so if there is an isolated incident in one of the PROD server rooms they can switch to the B servers in matter of minutes. If it is a geographical location specific issue or possible severe inclement weather at our primary DC then it is a relatively easy near automatic re-routing to one of the three DC sites to maintain network resiliency.
Granted 4 DCs with 8 sets of fully equipped PROD servers would be overkill for the airlines, but having at least a full set of standby servers at a different location seems to be a sensible course of action.
Thanks for the analysis of what happened with Delta today; really appreciate the last point re minors. Will be a good way to assess future IRROPS if that caution appears.
Comments are closed.