Don't Believe "software never fails"
The claim "software never fails" is cavalier and dangerous. It creates a mindset that itself leads to software failures.
On March 26, 2024, a container ship collided with one of the piers of the Francis Scott Key Bridge in Maryland. Within 30 seconds, the entire bridge had collapsed into the Patapsco River, killing six people1.
The collapsed Francis Scott Key Bridge
I'm going to make a bold claim: the bridge didn't fail.
That's right. Six people dead. $15M economic loss per day. The bridge didn't fail.
After all, it did exactly what it was designed to do. The bridge was doomed the moment its blueprints were approved, before a single truss was machined, before a single pier was erected. From the moment of impact, its collapse was mechanically inevitable. A computer simulation could have easily predicted it.
How does my claim make you feel? How do you think the families of the dead would feel? They'd be ready to throw me off that bridge! Of course the bridge failed. Look at it!
Savor your righteous indignation, because that is exactly what I felt when I read the post Software Never Fails2.
It claims that:
The brief reason we can say software never fails is that software always does precisely what we told it to do. The outcome may be undesirable, and it may be unexpected, but the software itself never deviates from the instructions it was compiled into.
The post offers a definition of failure:
A failure is the event that occurs when a component first does what we designed it to, then something happens, and then the component no longer does what it was designed to do.
If we are to consistently apply this claim and definition, then the Francis Scott Key Bridge didn't fail. It did precisely what it was designed to do.
Software developers: Now imagine this kind of thinking applied to your codebase, or the medical device your grandmother relies on.
The failure of "software never fails"
The syllogism presented in Software Never Fails goes:
- Premise 1: Failure means some component changed from working to non-working.
- Premise 2: A program's compiled instructions never changes.
- Conclusion: "Software is design-stuff. It doesn't fail ever."
Both premises are false, and so is the conclusion.
A faulty definition
Premise 1 is false by its definition of failure.
The article cherry-picks a definition from hardware reliability engineering which says that failure is a state transition from working to non-working. A capacitor dries out, and a bearing wears down. While this definition is suitable for hardware, it doesn't apply to programs. If a program "never deviates from the instructions it was compiled into", how can it transition? Software is definitionally excluded.
In When the Machine Metaphor Breaks Down, I compared and contrasted software failures to hardware failures. I provided precise definitions of failure that apply to both:
- Sins of commission: The thing did something it was required not to.
- Sins of omission: The thing didn't do something it was required to.
- Latent faults: Bugs that are present but not causing observable failures.
- Unmet expectations: The user was surprised the thing did or didn't do something.
- Compromised safety: A hazard or loss occurred and the thing was involved, even if it performed as required and expected.
I also showed that hardware and software fail for different causes:
- Hardware fails as its internal physical condition deteriorates.
- A program fails as it environment changes, invalidating its assumptions and preconditions.
- A codebase fails as its design integrity dissolves and defect count rises.
Software is surprising
Premise 2 is also false.
The claim "software never changes" wants us to to believe that programs are deterministic, and if deterministic, then predictable, and therefore incapable of surprising us with failure.
Deterministic means the same inputs produce the same outputs.
But programs are effectively nondeterministic. Yes, it's narrowly correct that compiled bytecode doesn't spontaneously mutate, but the argument conflates instructions with execution. A program's behavior depends not only on its bytecode but on its entire input space. Proving a program correct requires proving it correct for all possible inputs and states. As Dijkstra wrote, tests can show the presence of bugs, but never show their absence.
The danger of believing "software never fails"
Design and implementation conflated
In building architecture, the distinction between design and construction is clear. Design is clay models and blueprints. Construction is pouring concrete and milling steel. Many structural failures happened because the construction did not follow the design3.
Software also has distinct design and implementation activities. Sometime this is easy to see: it's common industry practice for a software architect to hand over a design specification to another team, perhaps even an offshore one. The distinction is harder to observe if the same person does the designing and the implementing.
When the Therac-25 radiation therapy machine killed patients in the 1980s, investigators found that the software implementation failed to properly sequence safety interlocks that the design had specified. The design was sound; the implementation was not. If we say "software is pure design," we have no vocabulary to disambiguate these failure modes, and we have no framework for preventing them.
Implementation errors are mitigated with verification, i.e. asking the question, "did we build the thing correctly?" Design errors are mitigated with validation, i.e. asking "are we building the correct thing?"
In A Design is a Mold for Code, I delve into the distinction between design and code.
Presbyopic causal chains
The benevolent intent of the phrase "software never fails" is to encourage us to look beyond proximal causes of failure and seek distal causes, but it goes too far. A high-quality report on a commercial airliner crash will include both: the plane had a problem, and a long causal chain of lax training, spotty maintenance, regulatory capture, and eroded safety culture from the very top allowed the problem to cascade into a catastrophe.4.
Presbyopia is the inability of the eye to focus clearly on close objects5, and I'm concerned that people who believe "software never fails" will look for distant causes of failure while allowing the immediate ones to blur into invisibility.
What is a causal chain? Many of us have had or heard a conversation with a young child that went something like this:
"Why can't I play on the grass?" "Because it's wet, sweetie" "Why?" "Because I turned the valve on." "Why?" "Because the grass is turning brown" "Why?" "Because we live in New Mexico" "Why?" "Because your great-grandfather moved here in the '40s" "Why?" "Because the Wright Brothers invented the airplane. This made the Japanese attack on Pearl Harbor possible, which brought the USA into World War 2, which resulted in the founding of Los Alamos in 1943.
As you can see, the more removed a cause is from its effect, the less plausible its association and the less actionable any mitigation or prevention. A proximal cause to my grass being wet is I turned the valve on. A distal cause is the Wright Brothers invented the airplane. Removing the middle links from the chain of causality reveals the absurdity: "You can't play outside because airplanes exist."
If we believe a claim like "software never fails", we sever the proximal links in the causal chain. We'll cast a wide net looking for monsters in other castles but not the ones under our beds.
Fix: The 4-part filter for relevant causes
Use this framework to discover relevant distal causes of failure without ignoring proximal causes. When engineers analyze why a failure happened or could happen, they privilege causes that:
- are proximal, physically or temporally. My hand turned the valve, and a few seconds later the grass got wet. The Pearl Harbor attack and invention of the airplane happened long ago and far away.
- exhibit agency. I chose to turn the valve. I didn't choose where to found Los Alamos.
- are controllable. Turning a valve is an event I can control (separately from choosing to). There's no action I can take about historical wars.
- are normal and expected. Turning a valve is a common event. Wars aren't.
Latent defects ignored
Bad metaphors breed bad decisions. What might actually happens if a software team internalizes the idea that "software never fails"?
If a team thinks software never fails, they might not allow themselves to believe their software could contain latent defects. They might think, "It work today, so it will always work." In their false confidence, they might skip testing, breeze through code review, dismiss bug reports, or skip error handling. This may let them ship quickly at first only to get bogged down fixing bugs later. In the meantime, hopefully those bugs don't cause harm to anyone.
Fix: Use better metaphors
In It's Not Tech Debt. It's Tech Risk, I show that the metaphors we use shape how seriously we take software problems.
Eroded alignment with nonexperts
The phrase "software never fails" communicates the wrong thing to non-practitioners. When a politician or mid-level manager hears "X never fails", they hear, "We never have to think about that." They don't share your nuanced understanding. This mindset could lead to poor business decisions, for example when leadership says "ship it" or downsizes the QA team. More broadly, it could lead to public harm when an elected or appointed official makes an misguided public policy decision.
Fix: Use different words
Let your word choices build, not erode, alignment between you and decision makers. For guidance on translating technical risks into terms leadership will prioritize, see my post Tech Risk is Business Risk.
A professional structural engineer would never say "bridges never fail". Instead of propagating incorrect absolute statements like "software never fails", society should train and recognize professional software engineers. My post It's Time to License Software Engineering argues the details.
What to do instead
Change your Mindset
When analyzing software failures:
- Use precise definitions of failure and definitions appropriate for software.
- Distinguish between design and implementation. Remember validation and verification.
- Use my 4-part test for causes
- Use metaphors that discourage hubris and language that communicates the right things to leaders.
Concrete actions
- Pretend your next commit message or PR description will be printed on the front page of the New York Times. How will your word choices come across to your manager, CEO, elected government officials, or to the public?
- In your next code review this week, identify one assumption about the environment that could become invalid. Are there places where the code assumes "this will always work"?
- Bookmark this page. The next time a postmortem goes off the rails, share it.
- Start a discussion. In e.g. Slack, Teams, or another collaboration tool, ask "how do we distinguish design errors from implementation errors?"
Summary
Saying a program can't fail because its deployed bytecode doesn't change is like saying the Francis Scott Key Bridge didn't fail because its steel molecules obeyed the laws of physics. One might object, "the bridge was destroyed by an external force, not internal failure!", and this is exactly my point about software. Programs eventually experience external forces that their designers didn't anticipate. If we excuse the bridge because it wasn't designed to withstand a collision with a modern container ship, then we must equally excuse software that fails when its environment changes. The question is how designs can anticipate and accommodate these stresses.
References
- Francis Scott Key Bridge Collapse
- Software Never Fails
- Hyatt Regency walkway collapse
- Trial by Fire: The crash of Aeroflot flight 1492
- Presbyopia
Related reading
- When the Machine Metaphor Breaks Down. How software failures differ from hardware failures
- A Design is a Mold for Code. Why design precedes and shapes code, not the reverse
- It's Not Tech Debt. It's Tech Risk.. How the metaphors we use shape how seriously we take software problems
- Tech Risk is Business Risk. Translating technical risks into terms leadership will prioritize
- It's Time to License Software Engineering. Why society should train and recognize professional software engineers