Problems Before the Real Problem: The First Lessons of Apollo 13

Setting the stage

Between 1968 and 1972 NASA sent nine Apollo missions to the Moon. Reaching the Moon was an effort that required a decade of work by 400,000 people, billions of dollars and an incalculable amount of moving parts. Of the nine missions to the Moon, eight were spectacular successes and one was a spectacular near-disaster.

Apollo 13 launched on April 11th, 1970 and was meant to be the first of the Apollo missions to be dedicated to exploring the Moon, after Apollo 11 made the first landing and 12 had improved on it by making a pinpoint landing.

On April 13th, a little over 2 days after launch, one of the two large oxygen tanks in the Service Module component of the spacecraft exploded, crippling the spacecraft on the way to the Moon. Over the next days, NASA’s engineers worked feverishly together with the astronauts to overcome the seemingly insurmountable problems and brought them back to Earth - safe and sound.

   

When looking at Apollo 13 and its problems and solution, what stands out is not how much the astronauts, engineers, and managers improvised to solve unexpected problems but rather the reverse - how their existing procedure was ready to be adapted to the unexpected.

Aptly, Apollo 13’s command module was named Odyssey, meaning a long voyage usually marked by many changes of fortune.

Even before Odyssey began its odyssey, it had an unusual start when it became the first mission where the flight crew was disrupted just before launch.

Three's a team

Flying a spacecraft is a complicated activity; so many things happen simultaneously, there are more buttons to press and procedures to follow than a single person can deal with at any one time. While Mercury, the first spacecraft, had been simple enough for one astronaut to handle, Apollo was a much larger and complex beast. 

The three components of the Apollo spacecraft — the Command Module (CM) with the astronauts, the Service Module (SM) with the supplies and main engine for the flight to the Moon, the Lunar Module (LM) for the landing itself. During the launch, the LM was shielded within the Saturn 5 rocket. (NASA)
 

Instead of expecting a single astronaut to control the spacecraft from beginning to end, the work was divided between three astronauts. The Commander, the Command Module Pilot and the Lunar Module Pilot - note that no astronaut is a mere co-pilot ;)

Now, each astronaut was able to specialize in their specific part of the mission (while remaining competent in other parts too), but astronauts were also able to support each other. After the lightning strike which crippled Apollo 12 at launch, the astronaut who flipped the “SCE to Aux” switch in the Command Module panel was Alan Bean, because he had the easiest access to the critical switch, despite being him being the Lunar Module pilot.

Taking the idea a step forward, in addition to having the three astronauts support each other during the flight, NASA also designated a “backup astronaut” for each one. The backup astronaut underwent nearly the same amount of training as the astronaut designated to fly and was sent to represent him at planning meetings (always fun!). Like being an understudy, the backup astronaut was available to replace the prime astronaut at a moment’s notice, but nothing short of a crippling injury would ever make an astronaut give up his flight. While a few astronauts had been forced to cancel their flights and allow their backups to fly, this had always been as the result of serious conditions (Deke Slayton had heart arrhythmia and Michael Collins had spinal surgery).

 The crew of Apollo 13: Lovell, Swigert, Haise. (NASA)

In the case of the ill-fated Apollo 13, Command Module pilot Ken Mattingly had been inadvertently exposed to Rubella (German Measles) just before the flight and was removed from the flight for medical reasons — despite the usually mild effects of the disease, no doctor was going to take a chance on some exotic and unexpected side effect while the astronauts were about to land on the Moon!

Just three days before the scheduled launch, flight commander James Lovell and Lunar Module pilot Fred Haise set out for a last-minute training regimen with backup Command Module pilot Jack Swigert. One of the few inaccuracies of the 1995 movie Apollo 13 was that the backup astronaut was less capable than the prime astronauts. The purpose of the last-minute training was not to check whether Swigert knew how to fly the spacecraft (which he unarguably did) but to see how the entire crew functioned together as a unit.

The last-minutes changes in the Apollo 13 crew and the way the crew functioned together are examples of the way the astronauts themselves were part of the Resilience and Reliability of the mission.

One must often trade cost for reliability. Training six astronauts instead of three takes more time, money and other resources, but having backups available means that you can recover when the unexpected occurs.

And now the machine: 

While the flesh-and-blood astronauts were the most critical component of the flight (the whole point of the flight was for a man, not a machine, to walk on the Moon), the entire 110 meter (363 foot) stack was built out of millions upon millions of highly reliable engines, pipes, connectors, switches, pumps, gauges, valves, computer chips and more.

 
The second stage of the Saturn V engine. Note the five J-2 engines which supply a total of 1,150,000 pounds of thrust (NASA)

During the first few minutes of flight, a failure in one of the second stage engines caused the rocket to gyrate wildly and the wayward engine was shutdown seconds before the flight would have been aborted.
As it happens, the engines had been designed with these types of failures in mind and could “pick up the slack”.

The remaining four healthy engines continued firing for longer than planned and made up for the defective engine.

Not even ten minutes into its flight, Apollo 13 had validated the engineering practices of building reliable components by having backups for everything and anything - both man and machine.

In the modern development of reliable software services, we use many patterns and techniques to achieve the reliability we require. While there are many similarities between the requirements of getting a man on the Moon and reaching your chosen website, software development leads to many abstractions that are not relevant for a flight in space. For example, if there’s a temporary failure between your local phone or laptop and the server you’re trying to reach then the local application can “invisibly” retry transient failures until it succeeds or decides that the failure is critical. With any luck, you won’t even notice this issue beyond a very temporary delay in bringing up the screen. While your developers and engineers are (almost certainly) not flying in space, you still train backups for on-call support rotation.

Of course, there are differences - Apollo had to survive with what it launched while software systems can deploy fixes mid‑flight

Now, having overcome Rubella before the flight even began and a failed engine during launch, the Apollo 13 astronauts and the NASA engineers in Houston could relax and enjoy a routine flight to the Moon, couldn’t they?

What else could go wrong?

Watch the movie or stay tuned for more articles to find out! 

Comments

Popular posts from this blog

Excellence Is a Habit

Back to Flight