Could software failures have played a part in the Deepwater Horizon disaster?
Is enough attention paid to independent testing in a world where we’re increasingly dependent on software controlled technology?
As always happens, news media are losing interest in a story about which they were totally obsessed only a few weeks ago. The catastrophic effects of the BP oil spill in the Gulf of Mexico will be felt for a long time, perhaps for more than a decade, but the media have moved on. Maybe the long term effects aren’t as obvious and dramatic as a flock of oil-sodden sea birds struggling pathetically to survive in their ruined habitat. They are felt by the proprietors of, and workers in, a devastated tourist industry. They are felt by pensioners whose investments are shrunk by the need to divert billions from what would have been profits into reparations and damages. They are felt by all of us for whom prices will go up as a result of a diminished appetite for deepwater drilling.
Although the media may be moving on, returning now and again whenever something stirs in the aftermath, more responsible interested parties will be spending a long time and a lot of effort trying to figure out what caused the Deepwater Horizon explosion in April 2010, an explosion lest we forget that not only caused an environmental disaster but also claimed the lives of eleven people. Perhaps, despite their best efforts, investigators will never be able to tell us what happened in which case we’ll simply have to be satisfied with speculation, educated guesswork, or whatever the culprits would have us believe (pace BP’s Accident Investigation Report1).
Informed speculation has started already, and it has struck a chord with Critical Software, a company that specialises in ensuring that software in safety critical applications doesn’t fail. How many people know what’s involved in drilling the sea bed for oil? Far from being a simply mechanical process, it actually depends on a lot of software-intensive control systems. It’s not widely appreciated, but most of the sophisticated technology that shapes all our lives depends on a lot of software. Sometimes, software failures are an inconvenience. So you had to restart your PC? Big deal. How about if the pilot’s ‘glass cockpit’ packs up in the middle of your holiday flight. That gives a whole new meaning to the ‘blue screen of death’!
In the case of Deepwater Horizon, it’s clear from the Transocean interim report to the Waxman committee that control system software is falling under suspicion.2 Reports have already surfaced in the Houston Chronicle3 that “display screens at the primary workstation used to operate drill controls on the Deepwater Horizon, called the A-chair, had locked up more than once before the deadly accident.” Given the amount of embedded software in oil-rig systems, or the dozens of operations that are carried out under software control, it’s no wonder that software is getting the third degree.
Software is relatively easy to write. Reliable, safety-critical software isn’t much harder to write than the common or garden variety that’s powering the browser that you’re probably using to read this, but it requires a disciplined, standardised approach. To really get close to perfection, it requires independent testing so that the developers’ assumptions, and even egos, are not allowed to stand in the way of the quest for those last few elusive bugs.
Independent testing of something that’s already been tested in the normal way, by its developer, is undeniably an extra expense. It’s not a prohibitive expense though – just the one that’s most likely to be cut when money’s tight and financial control is wielded by those that don’t really understand the true value of what they’re cutting. Critical Software’s experience is that cost pressures are all too often allowed to bear on the safety-critical part of the software development process. Do we skip physical safety checks on trains and boats and ’planes? Not likely! So how is it OK to let finance directors and others of their ilk cause the cutting of corners when it comes to the more abstract and less tangible factors in the safety equation?
The initial theme of this blog was the Deepwater Horizon incident, and it has to be said that the involvement of software errors as one of the causes is still only a matter for speculation. However, in case anyone doubts that software errors have the potential to cause disasters involving loss of lives, here are some proven examples involving mayhem out of all proportion to either the cost of the software or the cost of testing it properly.
Perhaps the biggest example is one that, thankfully, only nearly caused Armageddon back in 1983, which was the direct result of a software bug in the Soviet early warning system4. The Russians’ system alerted that the US had launched five ICBMs. Thankfully, the duty officer at the early warning station had a “…funny feeling in my gut”, reasoning that if the US was really attacking they would launch more than five missiles. He therefore didn’t escalate the alert through the command chain. The trigger for this near apocalyptic disaster was traced to a fault in software that was supposed to filter out false positives caused cloud-top reflections in satellite images.
Each of us can only die once, and so at a personal level it matters little if we go alone or with the rest of humanity. Here’s a reminder of a slightly less dramatic, but still lethal episode that occurred more recently than the Cold War5. On February 25, 1991 an Iraqi Scud missile evaded Patriot anti-missile defences and hit the Dhahran American Army barracks. The incoming missile was not detected because of a software flaw that prevented real-time tracking. The bug caused an inaccurate calculation of the current time, which had drifted some 0.36 seconds since booting, due to arithmetic errors. The missile was too fast and the system had been in use for too long (over a hundred hours instead of the planned for fourteen). The software was patched and recommissioned a day later, but the missile strike left 28 dead and around 100 wounded.
Until independent testing, by truly qualified testers, is recognised as sacrosanct within safety-critical developments then we’ll continue to have aircraft falling out of the sky, runaway cars, space-launch disasters and yes, oil rig disasters.
As we consider Deepwater Horizon, and reflect on numerous tragic and near-tragic disasters that have involved inadequate attention to software reliability, we find ourselves at the dawn of a new era of nuclear power generation. It’s time to start changing attitudes now.
- “Deepwater Horizon – Accident Investigation Report”, executive summary, BP, 8 September 2010; http://www.bp.com/liveassets/bp_internet/globalbp/globalbp_uk_english/incident_response/STAGING/local_assets/downloads_pdfs/Deepwater_Horizon_Accident_Investigation_Report_Executive_summary.pdf
- “Deepwater Horizon Incident—Internal Investigation,” draft report, Transocean, 8 June 2010, p. 15; http://energycommerce.house.gov/documents/20100614/Transocean.DWH.Internal.Investigation.Update.Interim.
- B. Clanton, “Drilling Rig Had Equipment Issues, Witnesses Say—Irregular Procedures also Noted at Hearing,” Houston Chronicle, 19 July 2010; www.chron.com/disp/story.mpl/business/7115524.html.
- The Man Who Saved The World By Doing… Nothing; http://www.wired.com/science/discoveries/news/2007/09/dayintech_0926
- Patriot-Scud Missile Tracking Error; http://www.ima.umn.edu/~arnold/disasters/patriot.html
- Jeanine Molloff: BP Disaster Enabled by our Government’s Failure to Regulate Corporations (huffingtonpost.com)
- Guest Article: Windows NT and the Deepwater Horizon (techrights.org)