If Neil Armstrong Were Your Engineer, You Wouldn’t Need Alerts
Apollo 11 didn’t lack insight — but it only succeeded when the right action was chosen.
“That’s one small step for [a] man, one giant leap for mankind” — Most people are familiar with these words, but how many people know that Neil Armstrong was only seconds away from saying “We didn’t land on the Moon because the computer kept rebooting”?
| Apollo 11’s Eagle Lunar Module, in flight |
Minutes before the planned moment of the first landing, Apollo 11 was skimming the lunar surface at over 1,300 kilometers per hour. The astronauts Neil Armstrong and Buzz Aldrin were concentrating on flying their strange-looking lunar lander and making sure they weren’t running out of fuel. The last thing they needed was the computer flashing esoteric error messages on the LED screen.
But at a critical juncture, the primitive dashboard on their onboard computer started raising an alarm. Not a single alarm, many.
While the astronauts continued flying, the engineers at Houston Mission Control rushed to interpret the errors — the computer was rebooting mid-flight. And not once, but repeatedly.
| The “high resolution” DSKY
display the astronauts used to navigate. Currently showing the 1202 alarm. |
1201! 1202!
With the astronauts low on fuel and perilously close to the surface of the moon, a decision needed to be made quickly. How to act? Use any of several abort options, change the flight plan, or continue flying with a misbehaving computer and risk a crash?
Fortunately, a rigorous training and operations regimen had prepared them for nearly every eventuality. The engineers had thick books, with descriptions of every possible behaviour or combination of reactions in the Apollo systems. Today, we’d call them Runbooks and use AI to search them. In 1969 they used a combination of index cards, documentation, and human memory to match the cryptic error code to the explanation. In a nutshell, the computer was saying “I’ve run out of resources, I’m rebooting and starting over!” again and again, every few seconds.
The 1201 and 1202 were “hidden” edge cases. They were documented in a manual but never expected during a live descent. There was no recovery script. It was the ultimate edge case occurring at the worst possible moment.
1201! 1202!
The
astronauts tersely requested information and the flight controllers needed to
find an answer quickly — could the Lunar Module be trusted?
So close to the Moon, the difference between landing successfully and crashing
was very thin.
1202! 1201!
The flight computer’s main controller, Steve Bales, used an analytical tool at his disposal. The trigger to abort was not merely “the computer is behaving abnormally” but “the Lunar Module is going to crash”. Not merely an alert, but a risk materializing. He checked with his colleagues, each of whom was responsible for a different aspect of the spacecraft’s systems and collated from them what today we would call the most important Key Performance Indicators (KPIs) or “Golden Signals”.
Is
the Lunar Module flying at the right speed, direction and angle? Is it in the
right place, at the right time?
Can the astronauts control it? Can the engineers on the ground keep in contact
with the spacecraft?
Does it have enough fuel to land?
While juggling the information, Bales used a secret weapon: a "cheat sheet" created during simulation training by another young engineer named Jack Garman. Together, they realized that as long as the alarms were intermittent and the "Golden Signals" of the mission — altitude, velocity, and fuel — remained nominal, the system was still succeeding.
| Bale’s “Cheat Sheet” with the cryptic error messages highlighted (NASA) |
![]()
Despite the reboots continuing to flash alerts, all the answers to his questions were “Yes”.
The astronauts got the answer they needed — “You are go for landing!”.
By shouting "Go!" over the loop, Bales performed the ultimate act of modern observability: he filtered out the noise to focus on the outcome. He turned a potentially literal "Crash" into a historic success because he knew which signals mattered.
The rest, as they say, is history.
Fast forward to 2026.
We may not be landing on the Moon, but we’re dealing with the same problems. Our Golden Signals do not reflect our readiness to land on the Moon but the capability of our systems to serve our customers. We often define these critical Golden Signals as the Latency (how slowly the system is responding), Traffic (how many requests the system is getting), Saturation (how “full” the queues or containers are) and Errors (this one is rather self-evident).
However, we rarely have all these signals available at our fingertips. We need to collect, collate, aggregate, and interpret information throughout the stack of the observability, optimization, security and operations systems we use. Once we’ve extracted insights from our systems, we need to decide what to do with the information – choose and execute the actions which will resolve our problems and return our systems to proper behaviour.
| NASA Mission Control – so many engineers, so many signals, so many insights & possible actions |
Armstrong, Aldrin, Bales, and all the other engineers on call during that historic event made the right choice, turned their insights into the correct actions, based on their planning, experience, and expertise. They turned an emergency into “business as usual” on the way to the Moon.
But here is the reality of 2026: Your stack is millions of times more complex than the Apollo computers. Can you afford to wait for a 'Neil Armstrong' to intervene during an incident?
| Apollo had fewer signals but clearer decisions. Modern systems generate more data — but not always better outcomes. |
This is where modern solutions are needed. Solutions such as the newly announced IBM Concert platform come into play — think of Concert as your digital Mission Control.
Concert is constantly examining your environment and keeping up with changes. It correlates the various anomalies (whether logs, events, or metrics) in your system (from applications, Cloud solutions, physical infrastructure, or anywhere else) with the possible solutions (fully automated, human-in-the-loop, or manual) and recommends the next best action to take, at the right time, by the right person.
Concert generates insights, and translates them into actions – Who, What, When, How.
Now, if Neil Armstrong and Buzz Aldrin were in control of your operations or Steve Bales were your SRE, you might not need IBM Concert… but they’re not.
That means you need something just as critical: the ability to turn insight into action — instantly.
| 1960’s Earthrise by Apollo 8 and 2020’s Earthset by Artemis II |
In 1969, Armstrong & Aldrin had Bales. Bales had Garman and a single sheet of paper.
In 2026, you have IBM Concert platform.
Apollo 11 proves that the real problem in IT operations isn’t visibility — it’s knowing what to do next.
The views expressed in this article are mine and do not necessarily represent the official position of my employer.
Comments
Post a Comment